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ABSTRACT 


Though  advances  in  VLSI  technology  will  soon  make  it  practical  to  construct 
parallel  processors  consisting  of  thousands  of  processing  elements  (PEs)  sharing  a 
central  memory,  the  performance  of  these  parallel  processors  is  limited  by  the  high 
memory  access  time  due  to  interconnect  network  latency.  This  thesis  is  a  study  of 
how  the  performance  of  a  parallel  processor  is  affected  by  associating  a  cache 
memory  with  each  PE  of  the  system.  Cache  parameters  and  policies  are  varied  and 
the  performance  of  the  resulting  cache  configurations  are  compared.  The  cache 
coherence  problem  is  discussed  and  a  solution  that  is  compatible  with  the  philosophy 
of  parallel  systems  is  adopted. 

Performance  is  analyzed  by  analytic  and  simulation  models.  Due  to  time  and 
space  limitations  the  simulation  modeling  is  done  in  a  hierarchical  fashion:  a  pri- 
mary level  simulates  a  single  cache  and  a  secondary  level  simulates  a  parallel 
machine.  The  simulators  can  run  in  a  trace-driven  and  self-driven  mode.  The  trace 
data  used  to  drive  the  simulators  was  collected  by  tracing  the  reference  patterns  of 
actual  parallel  programs.  An  approximate  analytic  model  is  developed  that  predicts 
the  queue  waiting  times  of  various  components  of  a  parallel  system,  enabling  the 
comparison  of  a  wider  range  of  cache  parameters  than  is  possible  with  the  simula- 
tors. 
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CHAPTER    1 


INTRODUCTION 


The  advent  of  VLSI  technology  uill  soon  make  it  practical  to  construct  parallel 
processors  consisting  of  thousands  of  processing  elements  (PEs)  sharing  a  central 
memory.  While  ideal  models  allow  each  PE  to  access  central  memory  m  one  cycle, 
physical  limitations  prevent  their  realization  (Schwartz  [80]).  Nevertheless,  appro.x- 
imations  can  be  built  by  replacing  the  single  cycle  direct  access  of  central  memory 
with  an  indirect  access  via  a  multi-cycle  interconnection  network.  However,  the 
performance  of  such  parallel  processors  is  limited  by  the  high  memory  access  time 
due  to  network  latency.  Moreover,  as  network  traffic  increases,  conflicts  within  the 
network  cause  this  latency  to  increase. 

Analogous  to  reducmg  the  effective  memory  access  time  m  uniprocessor  sys- 
tems, a  memory  hierarchy  can  be  used  in  parallel  processor  systems  that  employ  a 
PE  to  central  memory  connection  network  —  the  hierarchy  comprising  a  local 
memory  associated  with  each  PE  and  a  large  central  memory.  The  inclusion  of  a 
local  memory  reduces  the  effective  memory  access  time  since  the  resulting  access 
time  is  an  average  of  the  network  and  local  memory  latencies.  Moreover,  since  the 
local  memory  services  a  percentage  of  all  memory  requests,  network  traffic  is 
diminished,  thus  reducing  network  latency.  The  indiscriminate  use  of  a  local 
memory,  however,  can  introduce  memory  coherence  violations  -  a  memory  system 
is    coherent   if  the  value  returned  from  a  load  reflects  the  last  value  stored  in  that 


address  (by  any  PE). 

A  local  meniory  can  be  implemented  as  a  separately  addressable  memory  or  as 
a  cache.  The  use  of  a  separately  addressable  memory  imposes  upon  compilers  and 
loaders  the  onus  of  managing  a  tuo-level  store.  The  use  of  a  cache  shifts  the  bur- 
den of  managing  a  two-level  store  from  software  to  hardware.  Also,  since  the  exe- 
cution paths  of  a  program  may  be  unpredictable,  large  amounts  of  extraneous  data 
is  likely  to  be  'oaded  w'Oj  a  separately  addressable  local  memory.  A  cache,  on  the 
other  hand,  is  a  '".lerfriand"  device,  i.e.  only  the  data  needed  by  the  processor  is 
brought  into  the  cache  and  obsolete  data  is  automatically  evicted  from  the  cache  . 
The  effectiveness  of  a  cache  as  a  local  memory  for  a  uniprocessor  environment  has. 
been  demonstrated  in  many  simulation  studies  (e.g.  Lee  [69],  Meade  [72],  Liptay 
[76],  Strecker  [76],  and  Smith  [82]).  These  simulation  studies  show  that  large 
caches  can  capture  the  overwhelming  majority  of  references  to  cacheable  variables. 

Design  specifications  for  caches  that  provide  high  performance  in  uniprocessor 
systems  are  well  known.  These  specifications  may,  however,  produce  unfavorable 
performance  levels  in  a  parallel  processor.  This  thesis  analyzes  the  standard  cache 
policies  to  determine  specifications  for  a  cache  that  will  provide  high  performance  in 
parallel  systems.  The  thesis  also  investigates  the  use  of  special  cache  functions  that 
are  intended  to  minimize  network  traffic  and  to  provide  a  capability  to  cache  shared 
read-write  variables. 


-  Depending  on  the  cache  line  size  and  program  locality  some  extraneous  data  may  reside  in  the 
cache. 


The  remainder  of  this  chapter  presents  background  material  on  caches  and 
cache  policies  used  in  uniprocessor  systems  plus  cache  organizations  for  some  mul- 
tiprocessor systems.  Chapter  2  presents  the  machme  model,  both  software  and 
hardware,  that  is  used  in  the  analysis  of  cache  memories  for  a  parallel  system. 
Chapter  3  discusses  cache  parameters  and  policies  and  their  effect  on  a  parallel  sys- 
tem. In  chapter  4  analytic  models  are  derived  for  the  machine  model.  Simulation 
methods  used  in  the  analysis  of  caches  are  also  presented  in  chaptc-r  4.  Results  of 
experiments  studying  various  cache  parameters  and  policies  using  the  analytic 
models  and  simulation  models  are  presented  in  chapter  5. 

1.1.    Background 

Historically,  large  disparities  between  processor  and  memory  speeds  limited 
the  development  of  high  performance  computer  systems.  Around  1960  the  concept 
of  a  memory  hierarchy  was  suggested  as  a  solution  to  the  memory-processor  speed 
differential.  The  purpose  of  a  memory  hierarchy  is  to  provide  large  memory  capa- 
city at  low  cost  per  bit  while  maintaining  an  access  rate  approximately  equal  to  pro- 
cessor speed.  The  hierarchy  consists  of  two  or  more  levels  of  interconnected 
memory  devices  —  the  lowest  level  being  a  small  fast  buffer  and  each  successive 
level  providing  higher  capacity  at  lower  cost  per  bit  and  larger  access  time.  The 
memory  is  logically  divided  into  pages  and  memory  management  facilities  transfer 
pages  between  levels  of  the  hierarchy  (using  hardware  and  software).  Typically  the 
processor  re^jives  data  from  only  the  lowest  level  in  the  hierarchy;  if  the  data  is  not 
resident  in  the  lowest  level,  it  must  be  transferred  from  a  higher  level. 


The  effectiveness  of  a  memory  hierarchy  is  derived  from  program  locality;  pro- 
gram locality  is  defined  by  two  criteria:  temporal  locality  and  spatial  locality  [Den- 
ning 72].  Temporal  locality  implies  that  recently  referenced  memory  locations  have 
a  high  probability  of  being  referenced  again  in  the  near  future.  Spatial  localitx 
implies  that  data  in  locations  surrounding  recently  referenced  locations  have  a  high 
probability  of  being  referenced  in  the  near  future.  The  memory  management  facili- 
ties rely  on  the  principles  of  program  locality  in  order  to  minimize  memory  access 
time.  Processor  requests  are  anticipated  by  transferring  page  sizes  of  greater  than 
one  datum  between  levels  of  the  hierarchy  and  maintaining  recently  used  pages  in 
the  lowest  levels  of  the  hierarchy  as  long  as  possible. 

For  the  remainder  of  this  section  a  two-level  memory  hierachy,  cache  and  main 
memory,  is  considered.  Management  of  this  hierarchy  is  performed  completely  by 
hardware.  Data  is  transferred  between  the  cache  and  main  memory  on  demand;  the 
amount  of  data  being  transferred  is  termed  a   cache  line   (or  block). 

1.1.1.    Cache  Operation 

The  management  and  functionality  of  a  cache  is  accomplished  using  a  data  store 
and  tag  store.  The  data  store  contains  data  comprising  cache  lines.  The  tag  store 
contains  the  addresses  (either  physical  or  virtual)  of  the  lines  in  the  data  store.  To 
maintain  the  association  between  addresses  and  data,  an  address  and  its  data  are 
kept  in  the  same  relative  position  in  their  respective  stores.  When  an  address  is 
presented  to  the  cache  the  upper  bits  of  the  address,  line  address,  are  used  to  check 
the  residency  of  a  line  in  the  cache.  The  low  order  bits  are  used  to  select  the  por- 
tion of  the  line  requested  by  the  processor.    If  the  line  is  resident  in  the  cache  - 


termed  a  cache  hir  -  the  requested  bytes  are  immediately  transmitted  to  the  proces- 
sor (assuming  a  load  was  requested).  If  the  line  is  not  resident  --  termed  a  cache 
miss  —  the  address  is  passed  to  main  memory  in  order  to  transfer  the  line  to  the 
cache.  When  the  line  is  transferred,  it  is  stored  in  the  data  store,  the  line  address  is 
stored  in  the  tag  store,  and  the  requested  bytes  are  transmitted  to  the  processor. 
Storing  a  new  line  in  the  cache  may  cause  an  eviction  of  a  line  already  resident. 

The  performance  of  a  cache  is  dependent  on  several  design  parameters.  The 
choice  of  a  design  is  not  simple  since  a  parameter  that  provides  optimal  perfor- 
mance may  not  be  cost  effective  or  may  be  undesirable  because  of  other  design  deci- 
sions. Discussed  below  are  some  of  the  parameters  that  affect  cache  performance; 
for  a  more  detailed  discussion  of  these  parameters  the  reader  is  advised  to  see  Smith 
[82], 

1.1.1.1.   Mapping  Method 

To  determine  the  residency  of  a  cache  line,  the  line  address  of  a  memory 
request  is  mapped  into  a  set  of  locations  in  the  cache  tag  store  and  is  compared  with 
the  tags  in  those  locations.  The  method  of  mapping  determines  the  placement  of  a 
line  in  the  cache.  There  are  three  significant  mapping  methods:  associative  map- 
ping, direct  mapping,  and  set-associative  mapping.  The  mapping  method  chosen  for 
a  system  has  a  small  effect  on  performance  (Kaplan  and  Widner  [73]),  but  due  to 


-  The  cache  hit  ratio  (or  simply  hit  ratio)  is  the  precentage  of  processor  requests  satisfied  by  the 
cache. 

•"  The  cache  miss  ratio  (or  simply  miss  ratio)  is  the  precentage  of  processor  requests  not  satisfied 
by  the  cache;  miss  ratio  =  1  -  hit  ratio. 


levels  of  hardware  complexity,  it  has  a  great  effect  on  the  cost  of  the  cache.  This  is 
especially  true  in  medium-size  systems  like  the  VAX  11  780  and  Data  General 
MV8000. 

Associaiive  mapping  allows  a  cache  line  to  reside  in  any  position  in  the  cache. 
This  requires  the  entire  line  address  be  kept  in  the  tag  store.  When  a  processor 
issues  a  memory  request  the  line  address  of  the  request  is  compared  in  parallel  with 
each  element  Ia  the  tag  store.  To  accomplish  the  parallel  comparisons  efficiently, 
content-addressable  memories  (CAMS)  are  used.  Associative  mapping  provides 
optimal  performance  cvc:  other  mapping  methods,  but  the  present  cost  and  size  of 
CAMs  do  not  make  the  r>cheme  cost  effective.  However,  advances  in  VLSI  technol- 
ogy may  reduce  the  cost  and  increase  chip  capacity. 

The  least  complex  method  is  direct  mapping.  Direct  mapping  uses  a  subfield 
of  the  line  address  as  an  index  into  a  single  position  in  the  tag  store  and  data  store, 
thus  separating  the  lines  into  equivalence  classes.    The  remaining  bits  of  the  line 

address  are  stored  in  the  tag  store  and  used  for  residency  comparisons  with  proces- 

k 
sor  requests.    If  a  cache  has  the  capacity  to  store    2     lines  and  a  line  address  is  com- 
posed of  n    bits,  the  low-order    k     bits  specify  the  line  position  and  the  remaining 
n-k    bits  are  stored  for  comparisons.    This   means  that   line   addresses   i  +  (j-2'^), 
y  =  0, 1,2,...  are  mapped  into  position    /  of  the  stores  (or  equivalence  class   i). 

Direct  mapping  uses  the  least  amount  of  hardware  in  both  mapping  and  in 
address  comparison.  Its  eviction  scheme  is  also  very  fast  since  only  one  line  of  a 
given  equivalence  class  may  reside  in  the  cache.  However,  this  limitation  is  the 
major  disadvantage  of  direct  mapping.    If  the  processor  consistently  makes  requests 


for  different  line  addresses  that  are  mapped  to  the  same  equivalence  class,  the  cache 
performance  is  substantially  reduced  due  to  the  main  memory  traffic  generated  by 
the  continual  swapping  of  lines. 

An  intermediate  mappmg  method,  in  terms  of  harrivare  complexity,  is  set- 
associative  mapping,  which  performs  close  to  optimal  with  little  cost  increase  over 
direct  mapping.  Set-associative  mapping  maps  a  line  of  a  given  equivalence  class  into 
a  set  of  cache  positions,  each  set  capable  of  containing  several  lines  For  example,  a 
cache  of  size  2*  with  a  set  size  of  2^  uses  the  low  crdeT  ks  bits  of  a.  n  bit  line 
address  to  specify  a  set,  and  the  remaining  n-k  +  s  bits  are  stored  in  the  tag  store. 
Memory  requests  mapped  into  a  set  are  compared  (in  paralle')  only  with  elements 
in  that  set.  Simulations  done  by  Kaplan  and  Winder  (73)  show  that  a  set  size  of  two 
or  four  provides  performance  levels  approximating  associative  mapping.  Since  this 
mapping  method  provides  good  performance  at  low  cost  it  is  used  in  many  unipro- 
cessor systems.  (Direct  mapping  and  associative  mapping  are  special  cases  of  set- 
associative:  At  a  set  size  of  one  set-associative  degenerates  into  direct  mapping;  if 
the  number  of  sets  equals  one,  set-associative  degenerates  into  associative  map- 
ping.) 

1.1.1.2.   Memory  Update  Policy 

When  a  processor  issues  a  store  to  a  particular  memory  location,  the  store  must 
be  eventually  reflected  in  main  memory  since  the  cache  is  only  a  temporary  storage 
device.  There  are  two  methods  in  which  updates  can  be  accomplished:  store-in 
(store-back,  write-back,  copy-back,  or  swap)  and  store-through  (or  write-through). 
These  methods  and  their  variants  are  discussed  in  detail  in  Bell  et  al.  [74]  and  Pohm 


et  al.  [75]. 

Store-in  is  a  policy  in  which  updates  are  made  immediately  in  cache  and  are 
reflected  in  main  memory  at  some  later  time.  Typically,  a  srnre-allocate  (or  write- 
allocate)  policy  IS  used  in  conjunction  with  store-in,  i.e.  a  cache  line  is  fetched  from 
main  memory  when  me  processor  issues  a  (load  or  store)  request  and  the  location  is 
not  resident  in  cache.  Thus,  theoretically,  the  frequency  of  main  memory  requests 
will  equal  the  miss  ratio  (asymptotically  approaching  zero  as  the  cache  size 
increases).  For  a  finite  cache  size  the  frequency  of  main  memory  requests  is  likely 
to  be  larger  than  the  miss  ratio  since  a  miss  may  cause  a  line  to  be  evicted  from 
cache. 

The  simplest  form  of  store-in  requires  mam  memory  to  be  updated  with  the 
contents  of  an  evicted  line,  with  the  update  taking  precedence  over  the  fetch  for  the 
missed  line.  All  evicted  lines  need  not  be  written  back,  however,  only  those  which 
have  been  modified.  A  bit,  termed  a  dirry  hit,  can  be  associated  with  each  line  and 
set  only  if  the  line  is  modified.  Upon  eviction  the  dirty  bit  is  examined  and  an 
update  is  performed  if  and  only  if  the  bit  is  set,  thus  reducing  the  amount  of 
interference  evictions  cause  on  normal  memory  requests  (Pohn  et  al.  call  this 
'ich.tmt  flagged  swap).  The  amount  of  interference  can  be  further  reduced  by  satis- 
fying a  miss  prior  to  the  memory  update.  Such  a  scheme  necessitates  a  register  to 
hold  the  modified  evicted  line  while  the  new  line  is  being  fetched  (flagged  register 
swap). 

Store-through  is  a  policy  in  which  main  memory  is  immediately  updated  when  a 
processor  issues  a  store  regardless  of  the  residency  of  the  location  in  cache  —  the 


cache  is  also  updated  if  the  location  is  resident.  A  no-s{ore-aIlocare  polic\  is  typi- 
cally used  with  store-through,  i.e.  a  location  is  brought  into  cache  if  and  only  if  the 
processor  issues  a  read  and  the  location  is  not  resident  in  cache.  This  method  has 
the  advantage  of  maintaining  cache-memory  coherence,  i.e.  main  memory  always 
reflects  the  most  recent  processor  store"*.  The  disadvantage  of  store-through  is  that 
it  generates  a  larger  number  of  main  memory  requests  (than  store-in):  the  fraction 
of  references  to  main  memory  asymptotically  approaches  the  fraction  of  processor 
stores  as  the  cache  size  increases  (instead  of  zero  as  in  «tore-in)  Futhermore,  the 
larger  number  of  memory  requests  reduces  processor  utilization  since  the  processor 
must  wait  for  the  update  to  complete. 

A  simple  enhancement  to  store-through,  which  improves  processor  utilization, 
is  to  buffer  the  store;  this  allows  the  processor  to  issue  a  store  and  continue  execu- 
tion while  the  update  is  taking  place,  waiting  only  if  a  cache  miss  occurs  or  a  second 
store  request  is  issued  before  the  first  is  completed.  Smith  [79]  shows  that  increas- 
ing the  store  buffer  size  to  four  words  provides  store-through  with  the  same  perfor- 
mance as  store-in.  (Note  that  the  use  of  such  a  buffer  no  longer  guarantees  cache- 
memory  coherence.) 


"*  Maintaining  cache-memory  coherence  provides  some  memory  reliability  and  fault-tolerance;  If 
the  cache  fails  due  to  parity  errors  m  the  data  store,  for  example,  the  cache  can  be  disabled  and  execu- 
tion can  continue.  Maintaining  cache-memory  coherence  also  simplifies  direct  memory  access  (DMA) 
inpuD'output. 


1.1.1.3.  Eviction  Schemes 

The  eviction  schemes  used  in  a  cache  system  are  similar  to  those  used  m  pag- 
ing systems:  least  recently  used  (LRL').  random  selection,  first-m-first-out  (FIFO). 
The  major  difference  is  that,  due  to  time  restrictions,  evictions  in  a  cache  system 
must  be  handled  completely  by  the  hardware;  a  call  to  the  operating  system  would 
take  longer  than  a  singie  memory  reference. 

Simulations  have  shown  that  system  performance  is  relatively  insensitive  to 
the  eviction  method  chosen  (Gibson  [67]  and  Strecker  [76]);  the  gains  that  are  pro- 
vided by  the  various  schemes  are  usually  outweighed  by  cost  differentials.  The 
most  complex  method  to  implement  --also  providing  the  maximal  performance  -  is 
LRU  (generally  used  in  large  computer  systems).  Small  computer  systems  tend  to 
use  random  selection  because  it  is  easiest  to  implement  and  performs  almost  as  well 
as  LRU. 

1.1.1.4.  Line  Size 

For  a  given  cache  size,  varying  the  line  size  causes  greater  variance  in  cache 
performance  than  any  other  cache  parameter.  The  choice  of  a  line  size  is  not  easy 
and  depends  on  many  issues  (both  hardware  and  software);  for  example,  a  line  size 
providing  less  than  optimal  miss  ratio  may  be  chosen  because  of  memory  bandwidth 
restrictions. 

A  large  line  size  provides  a  "iook-ahead"'  capability  to  the  cache,  enhancing 
performance  due  to  spatial  locality  of  programs.  A  large  line  size  will  also  cause 
the  cache  to  fill  up  faster  (Easton  and  Fagin  [78]),  which  is  important  if  context 
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switches  are  frequent.  Large  line  sizes  also  reduce  the  amount  of  bookkeeping:  for 
a  fixed  size  cache,  fewer  lines  can  be  stored,  thus  the  size  (number  of  entries)  of  the 
tag  store  is  reduced  and  so  are  the  number  of  bits  needed  for  replacement  algo- 
rithms and  update  policies. 

Small  line  sizes  have  the  advantage  of  loading  into  cache  only  data  that  is  very 
likely  to  be  referenced.  A  large  line  size  can  pollute  the  cache  by  containing 
extraneous  data,  reducing  the  utilization  of  the  memory  device.  If  task  switches  are 
infrequent,  useful  information  is  less  likely  to  be  evicted  due  to  line  conflicts  when 
line  sizes  are  small.  Also,  small  line  size  requires  a  small  cache-main  memory  bus, 
reducing  cost. 

Sniith  [85]  performed  extensive  simulation  studying  the  relationship  between 
line  size  and  miss  ratio.  The  simulations  were  based  on  several  machine  architec- 
tures and  used  trace  data  from  a  variety  of  applications.  Smith  found  the  following 
relationships  between  line  size  and  miss  ratio:  Let  m,  be  the  average  miss  ratio  for 
a  cache  line  size  ;.  Let  r,  be  defined  as 


mi 


m-L. 


(1.1) 


For  a  cache  size  of  32k  bytes  the  given  line  sizes  yielded  the  following  ratios 


Line 

bizes 

16 

32 

64         128 

.580 

.607 

.621       .700 

1.1.2.    Caches  in  a  Multiprocessor  Environment 

A  cache-main  memory  hierarchy  m  a  multiprocessor  system  can  be  designed  in 
two  basic  organizations:  A  private  cache  organization  associates  a  cache  with  each 
processor,  the  caches  being  interconnected  with  main  memory  via  a  connection  net- 
work (see  figure  i.!a)  A  shared  cache  organization  associates  a  cache  with  each 
memory  module  comprising  main  memory.  The  processors  and  caches  are  intercon- 
nected via  a  connection  network  (figure  1.1b).  The  following  subsection  discusses 
various  implementations  of  the  private  cache  organization  and  whether  the  imple- 
mentations can  support  large  numbers  of  processors.  Shared  cache  organizations 
are  not  discussed  since  for  large  paralled  systems  such  an  organization  does  not  sig- 


a)  private  caches 


b)  shared  caches 


Fisure  1.1 
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nificantly  reduce  the  large  latency  introduced  by  the  interconnection  network.    For 
an  example  of  a  shared  cache  organization  see  Yeh  [81]. 

Another  alternative  to  the  design  of  a  cache-main  memory  hierarchy  in  a  mul- 
tiprocessor system  is  to  employ  a  private  cache  and  a  shared  cache.  In  this  organi- 
zation the  non-shared  data  of  a  PE  is  maintained  in  its  private  cache  and  shared  data 
is  mamtained  in  a  shared  cache  accessable  to  all  the  PEs.  Such  an  organization  is 
likely  to  support  more  PEs  than  a  simple  shared  cache  organization  since  the 
request  rate  to  the  shared  cache  is  only  a  fraction  of  the  total  request  rate.  Since 
references  to  shared  data  must  traverse  the  connection  network,  the  number  of  PEs 
that  can  be  supported  by  such  an  organization  is  limited  by  the  network  latency 
(which  increases  as  the  number  of  PEs  increase).  For  an  example  of  a  '"combined" 
organization  see  Dubois  and  Briggs  [81]. 

1.1.2.1.   Private  cache  organization 

Since  a  datum  may  reside  in  several  different  caches,  the  use  of  a  private  cache 
organization  for  multiprocessors  introduces  cache  coherence  problems.  Standard 
cache  techniques  used  in  uniprocessor  systems,  such  as  store-through,  are  inade- 
quate to  ensure  coherence  in  a  multicache  environment  since  modifications  are 
reflected  only  in  main  memory  and  not  in  all  the  caches.  Suppose  a  PE  contains  a 
word  in  its  cache  another  PE  modifies  that  word;  the  first  PE  will  be  unaware  of  the 
modification. 

A  simple  solution  to  cache  coherence  is  to  allow  the  caches  to  broadcast  to  one 
another,  over  a  common  bus,  the  locations  of  modified  lines.    If  a  cache  contains 


one  of  these  locations,  it  invalidates  us  corresponding  directory  entry.  This  solution 
reduces  performance  since  the  primary  function  of  a  cache,  reduction  of  memory 
access  time,  is  hampered  by  frequent  and  otten  needless  interrupts  generated  by 
other  caches.  More  importantly,  access  to  the  bus  is  serial,  limiting  the  perfor- 
mance. Due  to  these  weaknesses  commercial  implementations  using  this  solution 
have  been  limited  to  only  two  or  four  processors. 

Tang  [76]  and  Censier  and  Feautrier  [78]  proposed  two  different  caching 
methods  for  maintaining  coherence  by  marking  a  line  as  either  shared  or  private  and 
requiring  that  a  line  be  private  before  being  modified.  A  line  is  private  if  only  one 
cache  can  have  a  copy  of  a  line  in  its  data  store.  A  line  is  shared  if  more  than  one 
cache  can  have  a  copy  of  a  line  in  their  data  stores.  The  two  methods  differ  in  the 
way  that  bookkeeping  is  done. 

Tang  suggests  the  use  of  a  central  cache  controller  to  coordinate  cache 
requests.  This  controller  would  contain  a  directory  of  the  line  addresses  resident  in 
all  the  caches,  with  each  address  marked  as  shared  or  private.  All  cache  requests 
pass  through  the  controller,  which  accepts  or  rejects  a  request,  updating  its  directory 
after  each  transaction.  When  a  processor  wants  to  modify  a  cache  line,  the  store  is 
granted  provided  the  line  is  private.  If  the  line  is  shared,  the  cache  must  issue  a 
private  request  to  the  controller  The  controller,  in  turn,  requests  all  caches  con- 
taining the  shared  line  to  remove  it  from  their  directories. 

Censier  and  Feautrier  suggest  a  more  distributed  method  of  coordination,  each 
memory  module  being  responsible  for  granting  or  denying  access  to  its  lines.  Book- 
keeping is  maintained  by  the  use  of  present  flags.    .\  set  of  flags  is  associated  with 
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each  line  where  the  number  of  flags  equals  the  number  of  caches.  The  k'^  tlag 
represents  the  k  cache  and  is  set  to  I  if  the  correspondmg  Hne  resides  in  the  k  ' 
cache.  Using  these  flags,  the  memory  module  can  determine  if  a  line  is  free  (i.e. 
not  residing  in  any  cache)  or  if  some  caches  must  be  notitied  prior  to  declaring  a 
line  private  (necessary  for  updating  line).  (This  caching  scheme  has  been  adopted 
for  the  S-1,  Widdoes  [79].) 

The  above  methods  of  cache  management  can  be  effective  in  small  multiproces- 
sor systems;  they  are,  however,  not  viable  for  large  parallel  processing  systems.  A 
central  controller  is  not  feasible  because  of  fan-in  limitations  and  serialization  penal- 
ties: Since  all  memory  requests  not  satisfied  by  a  cache  pass  through  the  central 
controller,  a  fan-in  network  is  necessary.  Network  delays  and  conflicts  limit 
memory  throughput  and,  more  importantly,  introduce  serialization,  thus  reducing 
system  performance.  Serialization  is  also  introduced  during  memory  updating.  For 
e.xample,  if  many  PEs  are  each  updating  one  point  in  a  matri.x,  groups  of  PEs  must 
wait  while  others  request  a  shared  line  to  be  made  private  (necessary  for  updating) 
thus  causing  a  serial  bottleneck.  The  line  size  can  be  reduced  in  an  attempt  to  limit 
the  problem,  but  small  line  sizes  increase  the  complexity  of  the  controller  and 
reduce  the  lookahead  capabilities  of  the  cache. 

The  distributed  method  of  cache  coordination  suggested  by  Censier  and  Feau- 
trier  is  not  viable  in  large  parallel  systems  because  of  memory  requirements  and 
serialization  penalties.  Since  the  number  of  present  flags  associated  with  a  line  is 
equal  to  the  number  of  caches  (i.e.  processors),  very  large  bit  vectors  are  required; 
for  a  medium  scale  parallel  processor,  say  64  PEs,  eight  bytes  of  bit  vectors  are 


required  for  each  line.  For  a  typical  line  size,  32  b>tes.  201  of  memory  is  used  for 
cache  bookkeepmg.  Censier  and  Feautrier  suggest  that  for  large  systems  the  Ime 
size  should  be  increased  (to  limit  the  storage  overhead  of  the  present  flags),  but 
large  line  sizes  introduce  serialization  during  memory  updating  as  indicated  above. 

Archibald  and  Baer  [84]  proposed  a  cache  coherence  solution  based  on  the 
work  by  Censier  and  Feautrier.  The  Archibald  and  Baer  solution  replaces  the 
present  flags  by  two  bits,  thus  reducing  the  amount  of  bookkeeping  overhead.  The 
solution,  however,  still  suffers  form  serialization  due  to  the  necessity  of  exclusive 
access  of  a  cache  line  before  a  modification  can  take  place. 

-  Goodman  [83]  suggests  a  caching  scheme  in  which  the  cache-main  memory  con- 
nection network  is  a  bus  and  each  cache  monitors  bus  activity.  The  scheme  is  simi- 
lar to  the  simple  caching  scheme  discussed  above  in  that  a  single  cache  services  its 
processor  requests,  monitors  bus  activity,  and  suffers  from  a  serial  bottleneck,  the 
common  bus.  The  scheme  minimizes  the  amount  of  bus  traffic  by  using  a  memory 
update  policy  called  write-once:  when  a  line  is  first  modified  a  cache  issues  a  store 
to  main  memory  and  other  caches  containing  the  line  invalidate  it  in  their  direc- 
tories; subsequent  stores  to  the  line  are  done  without  memory  updates.  If  a  cache 
needs  to  access  a  non-resident  line  it  issues  a  main  memory  request;  if  another  cache 
contains  a  modified  copy  of  the  line  it  services  the  request,  otherwise  the  request  is 
serviced  by  main  memory. 

Goodman  demonstrates  with  simulations  that  the  write-once  update  policy  does 
minimize  network  traffic,  thus  making  it  viable  for  small  parallel  systems.  How- 
ever, the  scheme  is  not  viable  for   large  numbers   of  PEs  because   of  the   serial 
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bottleneck  generated  by  the  common  bus.  Rudolph  and  Segall  [83],  Prapamarcos 
and  Patel  [84],  and  Katz  et.  al.  [85]  have  proposed  extensions  of  the  Goodman  cach- 
ing scheme;  their  schemes  reduce  the  number  of  cache  invalidations  and  the  amount 
of  bus  traffic.  Rudolph  and  Segall  also  proposed  a  multiple  shared-bus  configura- 
tion where  the  caches  and  the  shared  memory  are  divided  into  banks  using  the  least 
significant  address  bits;  each  bank  shares  the  same  bus.  Using  such  an  organization 
they  claim  a  processor  can  be  constructed  havmg  as  many  as  32  to  256  PEs.  Build- 
ing a  processor  with  more  PEs  is  not  possible  because  of  hardware  complexity  and 
packaging  limitations.  An  alternative  connection  network  is  not  possible  since  each 
cache  must  monitor  all  network  traffic. 

None  of  the  cache  coherence  solutions  discussed  above  is  adequate  for  a  highly 
parallel  processor.  The  solutions  do  support  small  numbers  of  processors,  the 
number  being  dependent  on  the  solution  technique.  However,  due  to  the  serializa- 
tion problems  the  solutions  do  not  scale  to  large  numbers  of  processors. 


CHAPTER   2 


MACHINE  MODEL 


2.1.   Parallel  Program  Structure 

A  parallel  program  consists  of  many  computational  units,  subsets  of  which  exe- 
cute concurrently,  coordinate,  and  share  data.  Since  a  single  program  may  consist 
of  diverse  subprograms  each  requiring  different  amounts  of  concurrency,  coordina- 
tion, and  shared  data,  characterization  of  parallel  programs  is  not  easy.  However, 
it  is  possible  to  define  a  program  space  in  which  parallel  programs  exist  (Jones  and 
Schwartz  [80]  and  Kung  [80]).  The  size  of  the  computational  unit  (granularity), 
method  of  coordination,  and  data  reference  patterns  define  such  a  program  space. 
Particular  architectures  may  be  more  suitable  for  certain  subsets  of  the  program 
space,  i.e.  realizing  a  higher  degree  of  hardware  utilization  and  program  speedup. 
This  thesis  is  concerned  with  the  subset  suitable  for  MIMD  machines. 

The  size  of  a  computational  unit  suitable  for  a  .VIIMD  machine  is  a  sequence  of 
high-level  language  statements  comprising  a  process  (Horning  and  Randell  [73]).  A 
unit  may  be  a  procedure,  executed  in  parallel  wnh  other  such  units  via  a  FORK 
(Conway  [63]  and  Anderson  [65])  or  PARBEG/X  (Dijkstra  [68]);  or  a  unit  can 
represent  one  or  more  iterations  of  an  iterative  sequence  of  statements,  several  such 
units  being  executed  in  parallel  via  a  FORALL  (Davies  [81])  or  DOALL 
(Lundstrom  and  Barnes  [80])  (earlier  work  on  parallel  loop  constructs  was  done  by 
Gosden    [66]    and   Draughon    et   al.    [67]).     Since   the    MIMD   machine   consists   of 
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independent  PEs.  units  executed  in  parallel  ma\  consist  of  homogeneous  or  hetero- 
geneous sequences  of  statements  The  operating  system  scheduler  (whether  central 
or  distributed)  obtains  a  process  from  the  ready-to-run  queue  and  initiates  its  execu- 
tion on  a  PE.   The  PE  executes  the  process  until  it  is  blocked  or  terminates. 

Computational  units  may  be  completely  independent  or  communicate  through 
shared  data.  The  frequency  o(  communication  and  the  amount  of  data  being  shared 
determines  t.'ie  level  of  parallelism  and  speedup.  Frequent  synchronization  limits 
the  size  of  a  corr.putational  unit  (Cytron  [85])  and/or  causes  a  computational  unit  to 
enter  a  blocked  state  (or  busy  wait)  many  times.  Both  these  conditions  reduce 
overall  performance  due  to  operating  system  overhead. 

Data  reference  patterns  strongly  influence  the  performance  of  a  computational 
unit.  A  unit  that  references  a  high  percentage  of  private  data  is  likely  to  execute  at 
PE  speed  since  private  data  can  be  maintained  in  local  memory  (either  cache  or 
separately  addressable  local  memory).  If  a  unit  references  a  lot  of  shared  data  per- 
formance is  reduced  since  shared  data  must,  in  general,  be  kept  in  shared  memory. 
Locality  also  affects  performance:  a  high  degree  of  temporal  locality  provides  a  high 
hit  rate  on  local  data  and  if  data  is  referenced  in  regular  patterns  (for  example  using 
do-loop  variables  as  indices)  prefetching  can  be  used  to  reduce  shared  memory 
latency:  however,  indiscriminate  used  of  prefetching  can  cause  a  program  to  gen- 
erate incorrect  results  (see  subsection  2.2.1). 
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2.2.   Machine  Architecture 

The  analysis  of  cache  performance  presented  in  this  thesis  is  based  on  an  archi- 
tecture that  can  be  broadly  classified  as  an  MIVID  machine  composed  of  N  =  2 
autonomous  PEs  sharing  a  central  memory,  the  central  memiory  being  composed  of 
N  individual  memory  modules  (MMs).  which  can  be  independently  accessed.  The 
PEs  and  the  MMs  are  interconnected  by  a  buffered  multistage  packet  switching  net- 
work. The  topology  of  the  network  allows  any  PE  to  access  any  MM.  A  processor 
network  interface  (PNI)  interfaces  a  PE  with  the  network  and  a  memory  network 
interface  (MNI)  interfaces  a  MM  with  the  network.  (See  Gottlieb  et  al.  [83]  for  a 
detailed  description  of  such  an  architecture.) 

The  actual  geometry  of  the  PE-MM  interconnection  network  is  not  important  to 
the  analysis,  but  the  following  aspects  of  the  network  are  important: 

i)  The  depth  of  the  network  (number  of  stages)  is  logarithmic  in  the 
number  of  PEs. 

ii)  The  network  is  capable  of  pipelining  requests,  i.e.  the  delay 
between  packets  equals  the  switch  cycle  time  not  the  network  transit 
time. 

in)  The  network  is  packet  switched,  i.e.  the  switch  settings  are  not 
maintained  while  a  reply  is  awaited. 

iv)  The  network  is  buffered. 

An  individual  switch  has  k  input  ports  and  k  output  ports  -  a  single  input  or  output 
port  is  capable  of  handling  an  n  bit  packet.  A  packet  received  at  an  input  port  is 
routed  to  a  output  port  according  to  its  main  memory  address.  Associated  with  the 
output  port  is  a  queue  that  enables  concurrent  insertions  of  packets  being  routed  to 
the  output  port. 
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The  PNI  performs  four  functions:  virtual  to  physical  address  translation,  cache 
management,  assembly/disassembly  of  netvvork  requests,  and  enforcement  of  net- 
work pipelinmg  policies  (see  below).  The  PNI  enables  the  processor  to  issue 
memory  requests  without  placing  the  processor  in  a  wait  state  --  issuing  a  request 
before  the  previous  one  is  acknowledged  is  termed  request  pipelining.  Since  the 
processor  may  issue  requests  faster  than  the  PNI  can  transmit  a  request  onto  the 
network,  the  PNI  maintains  a  queue  of  processor  requests. 

The  MNI  performs  two  functions:  assembly/disassembly  of  network  requests 
and  support  of  fetch-and-phi  operations  (Gottlieb  and  Kruskal  [82]).  The  MNI  also 
maintains  two  queues:  The  first  queue  is  used  to  maintain  requests  received  from 
the  network  that  are  pending  memory  service.  The  second  queue  maintains  requests 
that  have  been  serviced  by  the  MM  and  are  waiting  to  be  transmitted  over  the  net- 
work. 

2.2.1.   Storage  Classes 

The  machine  architecture  supports  three  basic  storage  classes:  instructions, 
private  data,  and  shared  data.  A  datum  is  considered  private  if  it  is  accessed  by 
only  a  single  process  during  the  datum's  lifetime.  A  datum  is  considered  shared  if  it 
may  be  accessed  by  several  processes.  Instructions  and  data  are  resident  in  the  cen- 
tral memory  and  (under  certain  restrictions)  are  allov\ed  to  migrate  to  the  caches. 

Because  of  the  asynchronous  behavior  of  MIMD  machines,  the  membership  of 
a  datum  in  a  particular  storage  class  has  a  large  effect  on  the  performance  of  a 
parallel  program  since  architectural  restrictions  must  be  placed  on  the  way  a  datum 
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is  accessed.  The  restrictions  are  necessary  in  order  to  ensure  tfiat  a  computation  is 
sequentially  consistent  (Lamport  [79]).  Maintaining  sequential  consistency  implies 
that  the  results  of  a  set  of  concurrently  executing  processes  comprising  a  program 
produce  the  same  results  as  if  the  processes  were  run  in  some  arbitrary  sequential 
order.  Collier  [81]  lists  three  sufficient  conditions  for  maintaining  sequential  con- 
sistency: uniprocessor  order,  coherence,  and  program  order^  Uniprocessor  order 
requires  that  loads  and  stores  to  the  same  location  generated  by  a  PE  occur  in  the 
order  determined  by  the  program  being  executed  on  the  PE.  Consistency  requires 
that  coherence  is  maintained  in  the  memory  hierarchy.  Program  order  requires 
that  loads  and  stores  to  shared  data  occur  in  the  order  determined  by  the  program. 

Uniprocessor  order  and  consistency  alone  are  not  sufficient  to  ensure  sequen- 
tial consistency.  In  example  2.1,  PE ,  can  issue  the  two  stores  in  reverse  order 
without  violating  uniprocessor  order,  but  reversing  the  stores  violates  sequential 
consistency.  Program  order  requires  that  loads  and  stores  not  only  be  issued  in  the 
order  determined  by  program  execution,  but  actually  be  serviced  by  a  MM  in  the 
order  determined  by  program  execution:  Since  access  to  central  memory  is  via  a 
buffered  connection  network,  queuing  delays  within  the  network  could  reverse  the 
order  in  which  two  requests  to  separate  MMs  are  serviced,  causing  sequential  incon- 
sistency. Though,  PE ,  may  issue  the  store  to  A  before  issuing  the  store  to  X,  net- 
work delays  may  cause  X  to  be  serviced  by  a  MM  before  A.  Furthermore,  the  loads 
of  X  and  -4  by  PE2  may  occur  before  the  store  to  A  reaches  the  MM.    Thus,  the 

•  An  architecture  does  not  have  to  strictly  adhere  to  these  restrictions;  however,  it  must  produce 
the  same  results  as  if  it  did  adhere  to  the  restrictions. 


Two  computational  units,  executing  on  PE  ,  and  PE-,  respectively,  are  communicat- 
ing via  a  shared  variable  A.  X  is  a  semaphore  to  indicate  valid  data  in  A.  Assume  A 
and  X  are  in  separate  MMs  and  are  initially  zero. 

PEj  PE-, 

A  :=  I  loop:   if  .Y=0  goto  loop 

.Y;=   1  h  :=  A^ 

Sequential  consistency  is  violated  if  at  termination  of  these  statements  h  =  0. 

Example  2.1 


requirement  of  program  order  can  limit  the  performance  of  a  MIMD  machine  since 
two  requests  to  shared  data  cannot  necessarily  be  pipelined.  Instructions^  and 
read-only  data  can  be  pipelined  without  violating  sequential  consistency.  Private 
read/write  data  can  be  pipelined  without  violating  uniprocessor  order  provided  net- 
work queues  and  MM  queues  are  serviced  in  a    FIFO  fashion. 

The  negative  impact  of  maintaining  sequential  consistency  on  performance  can 
be  reduced  in  two  ways.  First,  the  latency  of  a  request  to  a  shared  datum  can  be 
hidden  to  some  degree  by  prefetching  the  shared  datum  before  it  is  used,  provided 
only  private  data  and  instructions  are  referenced  between  the  point  of  prefetching 
and  the  actual  reference  point.  Second,  the  amount  of  forced  serialization  can  be 
minimized:  Dependency  relationships  between  shared  data  can  be  generated  to 
determine  when  serialization  must  be  enforced  (Snir  [85]).  To  enforce  the  serializa- 
tion a  parallel  processor  could  provide  a  serialization  instruction  for  shared  vari- 
ables (Brantley  et.  al.  [85]). 


-Insrructions  are  assumed  to  be  non-self-modifying . 
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2.2.2.    Coherence  and  Cacheable  Data 

As  described  in  section  1.1.2.  a  parallel  system  employing  a  cache-main 
memory  hierarchy  introduces  memory  coherence  problems.  Maintaining  coherence 
in  the  memory  hierarchy  incurs  performance  penalties.  The  simplest  scheme  for 
ensuring  coherence  is  to  prohibit  shared  read/write  data  from  residing  in  cache. 
Private  data  and  instructions  may  be  placed  into  a  cache  without  coherence  viola- 
tions. If  a  process  is  allowed  to  migrate  to  another  PE ,  facilities  m.ust  be  present  to 
ensure  that  main  memory  is  updated  and  the  contents  of  the  cache  is  invalidated 
before  the  process  migrates.  Such  a  caching  scheme  greatly  impacts  performance 
since  all  references  to  shared  data  must  traverse  the  interconnection  network.  More 
sophisticated  schemes  which  minimize  the  penalty  of  maintaining  coherence  in  the 
memory  heirarchy  are  discussed  below. 

Two  different  solution  techniques  for  ensuring  coherence  in  the  memory 
heirarchy  are  run-time  coherence  checks  or  compile-time  coherence  checks.  Both 
techniques  ensure  that  coherence  is  maintained  by  enforcing  the  following  two  con- 
ditions: 

i)    a  datum  is  permitted  to  be  resident  in  more  than  one  cache  pro- 
vided it  is  accessed  as  read-only. 

//;     mutual   exclusion    must    be    maintained    on    the    residency   of   a 
shared  datum  uhen  the  datum  is  modified. 

Note  that  the  above  conditions  are  equivalent  to  the  conditions  for  coordination  in 
the  reader/writer  problem  (Courtois  et  al.  [1971]). 

Run-time  coherence  checks  can  be  performed  by  hardware  in  a  centralized  or 
distributed  fashion  (see  section   1.1.2.1  for  a  discussion  of  <;ome  implementations  of 
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run-time  checks).  Maintaining  coherence  by  hardware  introduces  seriaHzation 
inherent  in  the  reader/writer  problem.  How  the  seriahzation  is  manifested  is  depen- 
dent on  the  hardware  implementation  (see  Yen  et.  al.  [85]).  The  amount  of  seriali- 
zation IS  dependent  on  the  cache  line  size  and  number  of  processes  modifying  data 
in  the  line.  The  dependency  on  the  cache  line  size  is  due  to  the  cache  maintaining 
residency  of  data  on  a  line  basis;  thus,  if  a  datum  is  to  be  modified,  mutual  exclu- 
sion must  be  maintained  for  the  entire  cache  line.  To  minimize  the  amount  of  seri- 
alization a  small  line  size  should  be  chosen;  however,  a  small  line  size  can  greatly 
increase  bookkeeping  costs. 

Compile-time  coherence  checks  tag  a  datum,  during  compilation,  as  either 
cacheable  or  non-cacheable.  The  cacheability  tags  may  be  maintained  on  a  segment 
basis  or  a  machine  instruction  may  distinguish  individual  references  as  cacheable  or 
non-cacheable.  Using  the  conditions  stated  above  for  maintaining  coherence  a 
cacheable  datum  can  be  defined  as  a  datum  that  is  accessed  as  read-only  by  one  or 
more  processors  or  if  modified  is  accessed  by  only  one  processor  during  a  computa- 
tional unit. 

To  determine  if  a  datum  is  cacheable  two  sets  are  defined  for  each  computa- 
tional unit,   C-,    to  be  executed  in  parallel: 

•  Read  Set.  R(C-)   is  the  set  of  data  read,  but  not  modified  by   C^- 

•  Write  Set.  WfCj   is  the  set  of  data  modified  by  C .. 

(These  sets  can  be  generated  at  compile  time  using  global  and  interprocedural  data 
flow  analysis  techniques.)  A  datum,  x.  is  cacheable  if  the  following  condition 
holds  for  all  computational  units  C-  and  C     H^j,   to  be  executed  in  parallel: 
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if  rcR^C,*or  vt\\VC,Mhen  xiW'tC.)  {l.\) 

From  the  above  condition  it  is  clear  that  private  data  can  be  tagged  as    cacheable 

since  private  data  is  accessable  b\  only  one  computational  unit.  Instructions  and 
shared  read-only  data  can  also  be  tagged  as  cacheable  since  the  data  is  not  con- 
tained in  a  write  set  for  any   C 

For  an  aggregate,  each  element  must  be  checked  independently;  thus  for  a  vec- 
tor, V.  V(k)  can  be  tagged  as  cacheable  if  the  above  condition  is  satisfied  for  each 
element  of  V.  Consider  example  2.2:  each  element  of  arrays  A  and  B  can  be  cached 
since  they  are  read-only.  The  elements  of  D  can  be  cached  since  each  element  is 
accessed  by  only  one  computational  unit;  upon  termination  of  the  computational 
units,  main  memory  must  reflect  the  last  value  stored  into  each  array  element"'. 

Though  compile-time  coherence  checks  are  static,  i.e.  the  cacheable/non- 
cacheable  attribute  is  determined  during  compilation,  the  attributes  may  change 
between  computational  units.    As  stated  in  section  2.1  a  single  parallel  program  may 

Matrix  Product 

Doall  i  =  1  to  N 
do  j  =  1  to  N 
do  k  =  1  to  N 
D(i,k)  =  D(i.k)  +  A(i,j)*B(j,k) 

Example  2.2 


■'  Since  several  elements  of  D  may  be  m  a  single  cache  line  multiple  dirt)-  bits  are  required  for  each 
cache  line.    Sec  section  3.5  for  details. 
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consist  of  several  parallelized  computational  units.  Between  two  such  units  the 
usage  of  a  datum  may  change,  for  example  from  read-write  to  read-only.  Thus  dur- 
ing the  second  computational  unit  the  datum  can  be  cached. 

In  the  above  discussion  no  distinction  was  made  about  the  storage  class  of  a 
datum;  therefore,  a  datum  can  be  cached  regardless  of  its  membership  in  a  particu- 
lar storage  class  provided  it  meets  condition  (2.1).  Thus,  a  refinement  of  the  shared 
data  storage  class  into  shared  cacheahle  and  shared  nnn-cacheahle  is  necessary.  The 
elements  of  the  arrays  A,  B,  and  D  of  example  2.2  are  examples  of  shared  cache- 
able  data.  A  synchronization  variable  is  an  example  of  a  shared  non-cacheable 
datum.  The  underlying  storage  class  of  shared  cacheable  and  shared  non-cacheable 
is  shared,  the  refinement  to  cacheable  and  non-cacheable  specifies  how  a  datum  can 
be  treated  within  the  memory  hierarchy.  The  distinction  between  private  and 
shared  cacheable  is  necessary  because  at  process  termination  shared  cacheable  data 
resident  in  a  cache  may  require  special  treatment  (see  below  and  section  3.5). 

In  the  previous  subsection  (2.2.1)  request  pipelining  was  stated  to  be  applicable 
for  instructions,  private  data,  and  shared  read-only  data.  The  class  of  data  that  can 
be  pipelined  can  now  be  extended  to  include  all  cacheable  data:  For  a  datum  to  be 
cacheable  condition  (2.1)  must  hold;  this  condition  requires  that  a  datum  be  read- 
only by  one  or  more  computational  units  or  if  modified  accessed  by  only  one  unit  - 
since  a  unit  has  exclusive  access  to  the  datum,  the  datum  can  be  considered  (tem- 
porarily) private.  Since  read-only  and  private  data  can  be  pipelined,  cacheable 
implies  pipelineable. 
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Condition  (2.1)  can  be  extended  by  considering  the  conditions  stated  by  Bern- 
stein [66]  to  ensure  that  a  set  of  computational  units  maintains  sequential  con- 
sistency: A  set  of  computational  units  can  be  executed  in  parallel  if  the  followmg 
conditions  hold  for  each  pair  of  computational  units  C    and   C  ,    for  all  i#=j 

i)  R(c,)  n  ^(Cj)  =  0 

Note  that  these  conditions  imply  that  condition  (2.1)  holds  for  all  a  (in  all  the  com- 
putaional  units).  Thus,  if  conditions  (2.2)  hold  then  all  data  accessed  by  a  unit  is 
cacheable.  At  the  termination  of  a  computational  unit  all  shared  cacheable  data 
must  be  evicted  from  the  cache  and  central  memory  must  reflect  the  last  value 
stored  into  each  shared  location. 

Although  computational  units  often  contain  synchronization  points,  and  there- 
fore do  not  meet  conditions  (2.2),  sub-units  of  the  computational  unit  can  be  defined 
that  execute  in  parallel  between  synchronization  points.  If  conditions  (2.2)  hold  for 
sub-units  that  execute  in  parallel  then  all  data  accessed  by  the  sub-units  can  be 
cached  during  the  execution  of  the  sub-units.  When  a  synchronization  point  is 
encountered,  central  memory  must  reflect  the  last  value  stored  into  any  shared  loca- 
tion that  will  be  accessed  by  a  sub-unit  of  another  computational  unit  after  the  syn- 
chronization point.  If  the  cacheable  attribute  of  a  shared  datum  is  changed  for  the 
execution  of  the  next  sub-unit,  the  datum  must  be  evicted  from  the  cache. 


■Similar  restrictions  are  necessary  for  pipelined  data.  If  the  underlying  storage  class  of  a  datum  is 
shared,  any  outstanding  request  for  that  datura  must  be  acitnowledged  prior  to  accessing  a  synchroniza- 
tion variable.  If  a  context  switch  occurs,  all  outstanding  requests  (private  and  shared)  must  be  ack- 
nowledged before  the  process  can  be  removed  from  the  PE. 


The  number  of  computational  units  that  can  be  executed  in  parallel  can  fre- 
quently be  increased  by  applying  compile-time  optimization  transformations.  These 
transformations  include  node  splitting  (Allen  and  Cocke  [72]  and  Kuck  et.  al.  [81]) 
and  array  alignment  (Cytron  [85]).  For  units  that  contain  procedure  calls,  interpro- 
cedural  analysis  is  required  to  ensure  that  parallel  execution  is  safe  (Burke  [84]  and 
Burke  and  Cytron  [85]).  Compile-time  techniques  such  as  loop  fusion  (Allen  and 
Cocke  [72])  can  be  used  to  reduce  the  number  of  synchronization  points  between 
units. 

Not  all  computational  units  that  can  be  executed  in  parallel  meet  condition  (2.2) 
or  can  be  reduced  to  sub-units  that  meet  condition  (2.2).  Consider  example  2.3:  the 
two  units  can  be  executed  in  parallel  and  the  result  of  "ok"  is  guaranteed  provided 
the  requirements  of  uniprocessor  order,  consistency,  and  program  order  are  met. 
Although  not  all  units  that  can  be  executed  in  parallel  meet  conditions  (2.2),  in  prac- 
tical applications,  most  units  have  a  regular  structure  with  simple  synchronization, 
and  do  meet  the  conditions  (2.2). 


Assume  A  and  B  are  initially  zero 

Cj  C2 

A  :=  \  X  :=  B 

B  :=  I  Y  :=  A 

if  y  >  Xthen  "ok" 

Example  2.3 


CHAPTER   3 


CACHES  IN  A  PARALLEL  SYSTEM 


A  private  cache  associated  with  each  PE  in  a  parallel  system  has  first  and 
second  order  effects  on  system  performance.  The  first  order  effect  pertains  to  the 
miss  ratio  of  an  individual  cache  and  the  effect  of  the  miss  ratio  on  the  associated 
PE's  utilization.  The  second  order  effect  pertams  to  the  level  of  network  traffic 
produced  by  a  PE-cache  pair  and  the  effect  of  this  traffic  on  the  utilization  of  all  the 
PEs.  If  a  PE-cache  pair  produces  a  large  amount  of  network  traffic,  requests  from 
all  PE-cache  pairs  may  be  delayed  in  the  network,  thus  reducing  the  utilization  of 
those  PEs.  In  designing  a  cache  for  a  parallel  system  both  of  these  effects  must  be 
considered  and  tradeoffs  may  be  necessary.  Due  to  the  large  central  memory  access 
time,  cache  policies  that  minimize  the  miss  ratio  are  desirable;  however,  a  policy 
providing  a  higher  miss  ratio  may  be  used  since  the  increase  in  the  miss  ratio  may 
be  outweighed  by  a  reduction  in  traffic  intensity. 

The  following  section  describes  the  general  organization  of  a  cache  for  a  paral- 
lel system.  In  subsequent  sections  various  design  parameters  (such  as  memory 
update  policies)  are  discussed  in  terms  of  their  effects  on  the  cache  miss  ratio  and 
traffic  intensity.  Modifications  to  the  basic  organization  to  support  particular  poli- 
cies are  also  presented. 
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3.1.   General  Cache  Oranization 

Since  the  use  of  a  connection  networtc  increases  the  average  memory  access 
time,  the  cache  should  provide  a  lou  miss  ratio  and  produce  a  minimal  amount  of 
network  traffic;  the  cache  should  also  minimize  the  effects  of  a  cache  miss  on  subse- 
quent PE  requests.  In  uniprocessor  systems  a  cache  miss  prevents  the  cache  from 
accepting  new  processor  requests  until  the  miss  is  satisfied  --  the  cache  is  in  a 
lockup  state.  To  minimize  the  lockup  effect  sophisticated  cache  designs  have  been 
proposed  that  permit  the  existence  of  outstanding  misses  while  normal  cache  opera- 
tions continue  -  such  cache  designs  are  termed  lockup-free  (Kroft  [81];  see  also 
Brantley  et.  al.  [85]  for  a  multiprocessor  extension).  The  lockup-free  attribute  of 
the  cache  permits  the  PE  to  prefetch  instructions  and  prefetch  data  without  the  delay 
penalty  of  a  cache  miss.  Thus,  for  PEs  having  prefetch  capabilities  and  having  the 
ability  to  have  several  outstanding  requests  —  such  as  the  IBM  801  (Radin  [82])  — 
the  PE  idle  time  due  to  cache  misses  can  be  reduced.  The  lockup-free  attribute  also 
allows  the  PE  to  continue  execution  after  a  store  miss  regardless  of  the  central 
memory  update  policy.  If  the  PE  issues  a  nonprefetch  load  request  and  a  miss 
occurs  the  PE  must  wait  for  the  datum  to  be  fetched  from  central  memory. 

Special  instructions  that  prefetch  data  into  the  cache  can  also  be  supported  by  a 
lockup-free  cache.'  A  prefetch-into-cache  instruction  differs  from  a  normal  data  pre- 
fetch in  that  no  PE  register  receives  the  result  of  the  prefetch;  rather,  the  data  is 
loaded  only  into  the  cache.    A  subsequent  instruction  would  be  issued  to  load  the 


'Indiscriminate  use  of  a  cache  prefetch  may  reduce  performance  since  a  prefetched  line  may  cause 
an  actively  used  line  to  be  evicted. 
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data  into  a  PE  register.  Such  an  instruction  could  be  used  as  follows:  a  PE  could 
issue  a  prefetch-into-cache  instruction  for  the  next  iteration  of  a  DOALL  while  exe- 
cuting the  present  iteration,  thus  overlapping  computation  with  data  fetch.  Once 
the  current  iteration  is  complete  the  PE  issues  a  load  instruction  to  load  the  new 
interation  into  a  register.  For  the  remainder  of  this  chapter  a  PE  is  assumed  to  have 
the  ability  to  issue  instruction  prefetches,  data  prefetches,  and  prefetch-into-cache 
instructions.  It  is  also  assume  the  a  PE  can  have  several  outstanding  request  simul- 
taneously. 
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Figure  3.1.    Basic  Cache  Organization 
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Figure  3.1  illustrates  the  general  data  flow  organization  of  the  lockup-free 
cache  .  The  tag  store  and  data  store  are  functionally  equivalent  to  the  stores  in  a 
uniprocessor  cache.  The  minimal  amount  of  bookkeeping  information  in  the  tag 
store  is  a  resident  bit  and  address  tag  for  each  cache  line.  Other  bookkeeping  infor- 
mation needed  for  central  memory  update  and  replacement  algorithms  is  discussed 
in  subsequent  sections. 

The  outstanding  request  buffer  (ORB)  is  a  small  associative  buffer  containing 
the  addesses  of  outstanding  PE  requests.  The  miss  tag  buffer  (MTB)  is  a  small 
associative  buffer  containing  the  line  addresses  of  outstanding  misses.  The  contents 
of  an  MTB  entry  is  shown  m  figure  3.2.  The  V  bit  .indicates  the  validity  of  an  entry. 
Line  Address  is  used  to  check  the  residency  of  a  cache  line  address  in  the  MTB.  /?i, 
Rt,  ■■•,  Rn  are  bits  that  indicate  the  residency  of  the  data  comprising  a  cache  line. 
The  number  of  residency  bits  is  dependent  on  the  central  memory  interleaving 
granularity  (see  section  3.4)  and  the  update  policy  (see  section  3.5).  The  miss  data 
buffer    (MDB)  is  a  buffer  capable  of  holding  a  single  line  or  each  outstanding  line 


V 

Line  Address 

Ri 

Ri 

Rn 

Figure  3.2.    Miss  Tag  Buffer  Entry 


"Control  lines  are  not  shown. 
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depending  on  the  central  memory  interleaving  granularity  (see  section  3.4). 

When  a  PE  issues  a  instruction  fetch,  data  load,  instruction  prefetch,  or  data 
prefetch  and  a  cache  miss  occurs  (i.e.  the  address  is  not  in  the  cache  tag  store),  the 
MTB  is  examined  for  the  residency  of  the  requested  datum.  If  the  datum  is  resident 
in  the  MTB,  it  is  returned  to  the  PE  from  the  MDB.  If  the  datum,  is  not  resident, 
the  requested  address  is  added  to  the  ORB  and  if  the  line  address  is  not  already 
resident  in  the  MTB,  it  is  added  to  the  MTB  and  a  sequence  of  central  memory 
requests  is  issued  to  fetch  the  line.  If  the  PE  issues  a  prefetch-into-cache  and  a 
cache  miss  occurs,  the  MTB  is  examined  to  see  if  the  line  has  been  fetched  (i.e.  if 
the  line  address  is  resident  in  the  MTB);  if  it  has  not  been  fetched,  the  address  is 
added  to  the  MTB  and  a  fetch  is  initiated.  Regardless  of  the  residency  of  the  line  in 
the  tag  store  or  the  MTB,  completion  of  the  prefetch-into  cache  is  acknowledged 
immediately.  If  the  PE  issues  a  store  the  action  taken  is  dependent  on  the  update 
policy  (see  section  3.5  for  details).  Completion  of  the  store  is  acknowledged 
immediately. 

As  portions  of  lines  return  from  central  memory,  they  are  assembled  in  the 
MDB  and  the  appropriate  residency  bits  are  set  in  the  MTB.  Also  the  addresses  of 
returning  line  portions  are  compared  with  the  entries  of  the  ORB.  If  an  address 
match  occurs,  the  corresponding  data  is  sent  to  the  PE.  Once  an  entire  line  is 
assembled  in  the  MDB,  a  position  for  the  line  is  allocated  in  the  tag  and  data  stores 
according  to  the  mapping  and  replacement  algorithms.  The  line  address  and  data 
are  then  transferred  to  the  respective  stores  and  the  entry  in  the  MTB  is  invalidated. 
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The  MTB  and  MDB  are  separate  physical  units  from  the  tag  and  data  stores 
because  the  division  allows  concurrent  operation  of  the  two  sets  of  hardware;  PE 
requests  can  be  serviced  while  a  line  is  assembled  in  the  MDB.  The  MTB  and  the 
MDB  can  be  eliminated  by  including  the  information  maintained  in  the  MTB  entries 
in  the  tag  store  entries.  Combining  the  MTB  and  the  tag  store  reduces  the  con- 
currency in  the  design,  increases  the  total  number  of  directory  bits,  and  introduces 
eviction  problems  (discussed  below).  Figure  3.3  shows  the  contents  of  a  ''com- 
bined" tag  store  entry.  The  Tag  field  is  the  same  as  in  a  uniprocessor  cache.  The 
UR  field  contains  update  policy  and  replacement  policy  information.  The  Filling  bit 
indicates  that  the  entry  is  valid  and  that  the  cache  line  has  been  fetched.  R\,  Rj,  ■■■, 
Rn  are  residency  bits  as  described  above.  Problems  with  eviction  occur  when  the 
entry  in  the  tag  store  to  be  evicted  ^  not  fully  resident  (i.e.  one  of  the  residency  bits 
is  not  set).  If  this  occurs  the  cache  must  lock  up  and  wait  for  the  line  to  become 
fully  resident.  (This  problem  does  not  arise  when  the  tag  store  and  MTB  are 
separate  since  a  line  is  not  transferred  from  the  MTB  to  the  tag  store  until  it  is  fully 
resident.  However,  the  MTB  may  become  full,  at  which  point  the  cache  must  lock 
up.) 


Tag 


UR      I     Filling       /?i    \   Ri 


R. 


Figure  3.3.  "Combined"  Tag  Store  Entry 
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3.2.  Mapping  Method 

The  address  mapping  method  used  in  a  cache  affects  the  miss  ratio  and  traffic 
intensity  in  the  same  way  (cache  line  conflicts  increase  both  miss  ratio  and  traffic 
intensity).  Simulations  of  mapping  methods  for  uniprocessor  caches  show  that  set- 
associative  mapping  with  a  set  size  of  two  or  four  performs  close  to  the  optimal 
method  of  associative  mapping.  In  a  parallel  system  larger  set  sizes  are  likely  to  be 
necessary  since  a  process  may  inherit  data  areas  from  parent  processes  and  these 
data  areas  may  be  dispersed  throughout  central  memory.  If  virtual  addresses  are 
cached  a  set  size  of  four  may  be  sufficient  since  the  dispersed  physical  addresses  can 
be  mapped  from  consecutive  virtual  segments. 

3.3.  Line  Size 

The  cache  line  size  is  a  parameter  that  strongly  affects  system  performance. 
The  predominate  measure  of  this  effect  is  the  cache  miss  ratio;  a  secondary  measure 
is  traffic  intensity.  Typically,  as  the  line  size  increases  the  miss  ratio  decreases", 
improving  system  performance.  Unless  the  miss  ratio  is  at  least  halved  when  the 
line  size  is  doubled,  the  network  traffic  intensity  will  increase  with  an  increase  in 
line  size.  The  extra  network  traffic  could  offset  any  system  improvement  gained  by 
the  reduction  in  the  miss  ratio.  Moreover,  even  if  the  traffic  intensity  remains  con- 
stant, a  large  line  size  could  offset  performance  improvements  due  to  conflicts  for 
shared  resources:   Devices  on   a   shared   bus  mav  be   starved   while   a  cache   line 


"If  the  line  size  becomes  too  large  the  miss  ratio  will  begin  to  increase  due  to  data  pollution  (see 
Smith  [82]). 
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transfer  takes  place,  or  a  memory  may  be  busied  for  a  large  number  of  cycles.  In 
the  case  of  a  parallel  system  the  shared  resources  are  the  interconnection  network 
and  the  central  memory. 

Due  to  this  conflicting  behavior  a  less  than  optimal  line  size  (with  respect  to 
miss  ratio)  may  be  used  in  order  to  minimize  the  amount  of  network  interference 
generated  by  a  line  fetch.  A  small  line  size  produces  a  less  than  optimal  miss  ratio 
when  a  process  begins  execution  on  a  PE  and  the  cache  is  being  filled  (termed  a 
cold  start,  Easton  and  Fagin  [78]).  Once  the  cache  has  filled,  the  line  size  has  less 
of  an  effect  on  the  miss  ratio.  In  fact,  a  small  line  size  may  be  preferable  to  a  large 
line  size  once  the  cache  has  achieved  steady  state  since  a  small  line  size  is  less  likely 
to  pollute  the  cache  will  extraneous  data. 

In  a  parallel  system  the  penalty  of  process  initialization  can  be  amortized  by 
executing  processes  of  medium  to  large  granularity.  For  example,  a  process  can 
execute  several  iterations  of  a  DOALL  rather  than  generating  one  process  for  each 
iteration  —  such  a  scheme  is  termed  chunking.  The  number  of  iterations  that  a  single 
process  executes  can  be  determined  statically  when  the  processes  are  created  or 
dynamically  as  the  processes  execute  (using  a  fetch-and-add  operation  on  the  loop 
induction  variable).  For  a  discussion  and  analysis  of  various  chunking  schemes  see 
Kruskal  and  Weiss  [84]. 

3.4.   Central  Memory  Interleaving  Granularity 

Central  memory  interleaving  granularity  (i.e.  the  number  of  consecutive  bytes 
stored  in  one  MM)  is  not  generally  considered  a  cache  parameter;  however,  it  does 
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affect  the  organization  of  the  cache  and  system  performance.  Since  central  memory 
is  composed  of  individual  MMs,  two  basic  interleaving  methods  can  be  used  with 
respect  to  the  cache  line  size.  The  first  method  is  to  have  an  entire  cache  line  be 
contained  in  one  MM;  thus,  a  single  message  is  sent  from  a  PNI  to  a  MM  requesting 
a  line  be  transferred  and  a  single  message  containing  the  line  is  returned  from  the 
MM.  Using  this  method  the  MDB  is  required  to  hold  only  one  line  --  the  one 
presently  being  received.  .Also,  no  residency  bits  are  needed  in  the  MTB  since  the 
cache  line  will  arrive  in  one  continuous  message.  This  method  has  two  potential 
problems:  First,  since  memory  is  interleaved  on  a  cache  line  basis,  accesses  to  con- 
secutive non-cacheable  data  locations  are  serialized  (the  amount  of  serialization 
being  dependent  on  the  line  size).  Second,  large  messages  transmitted  through  the 
network  are  likely  to  have  a  negative  impact  on  performance  due  to  network  con- 
flicts. If  two  messages  are  routed  to  the  same  switch  output  port  only  one  message 
can  be  transmitted  to  the  next  stage  and  the  other  must  wait;  the  amount  of  time  the 
second  message  must  wait  increases  with  the  size  of  the  messages  being  transmitted. 

A  second  method  for  interleaving  is  to  have  a  cache  line  be  distributed  o%er 
several  (consecutive)  MMs,  say  n;  thus,  n  separate  messages  are  issued  from  the 
PNI  and  n  separate  messages  return  from  the  MMs,  each  message  containing  one 
n'^  of  a  line.  This  method  permits  memory  to  be  interleaved  on  a  PE-word  basis, 
maximizing  concurrent  access  (to  PE-words).  The  method  also  permits  portions  of 
a  line  to  traverse  the  network  independently,  minimizing  the  amount  of  interference 
one  line  fetch  has  on  another  fetch.  However,  the  method  requires  several  mes- 
sages to  be  transmitted  to  the  MMs  instead  of  one.    This  is  only  a  small  penalty 
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since  load  requests  are  short  messages,  consisting  of  only  the  MM  address  and  the 
location  of  a  datum  in  the  MM. 

The  method  also  requires  that  the  MDB  be  capable  of  holding  each  outstanding 
line.  This  is  necessary  since  portions  of  lines  may  be  interspersed  as  they  are 
received  from  the  connection  network.  Also,  n  residency  bits  are  needed  for  each 
entry  in  the  MTB  to  indicate  the  residency  of  a  partial  line.  Full  residency  is  indi- 
cated when  all  the  bits  are  set. 

Alternative  methods  for  fetching  a  line,  when  n  requests  are  needed  to  fetch  a 
line,  issue  a  single  line  request  from  the  PNI  and  generate  n  returning  messages. 
These  methods  require  the  network  to  split  a  message  into  n  requests  or  require  the 
MMs  containing  a  line  to  be  bussed  together.  Both  of  these  methods  reduce  the 
amount  of  PE  to  MM  traffic  and  increase  the  hardware  complexity  of  the  network 
switches  or  MMs.  Since  the  conflicts  generated  by  line  fetches  are  likely  to  occur  in 
transmission  from  the  MM  to  PE,  because  of  the  larger  message  sizes,  the  savings 
in  the  PE  to  MM  traffic  is  is  unlikely  to  justify  the  additional  hardware  complexity. 

3.5.   Central  Memory  Update  Policy 

The  cache  update  policy  has  a  larger  effect  on  the  traffic  intensity  than  on  the 
miss  ratio.  Since  minimizing  network  traffic  is  a  major  objective  of  the  cache 
design,  a  store-in  policy  intuitively  appears  to  be  a  better  choice  than  store-through 
(since  the  miss  ratio  of  store-in  asymptotically  approaches  zero  as  the  cache  size 
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increases).  However,  a  cache  flush''  may  flood  the  network  with  requests  and 
greatly  reduce  system  performance  until  the  flush  terminates.  Thus,  the  level  of 
network  traffic  produced  when  a  process  is  initiated,  while  it  executes,  and  when  it 
terminates  execution  must  be  examined  before  an  update  policy  can  be  chosen. 

The  major  advantage  of  store-through  in  uniprocessor  systems  is  that  central 
memory  reflects  the  most  recent  PE  store.  When  the  execution  of  a  process  ter- 
minates on  a  PE  (because  of  preemption,  blockage,  or  termination)  central  memory 
is  coherent  with  the  cache  (except  for  the  store  requests  that  are  traversing  the  net- 
work from  the  PEs  to  the  MMs).  Store-through  also  has  the  advantage  of  requiring 
a  minimal. amount  of  bookkeeping:  only  a  residency  bit  and  the  address  tag  for  each 
line  (replacement  policy  information  may  also  be  necessary).  The  number  of 
residency  bits  needed  in  the  MTB  is  only  dependent  on  the  central  memory  inter- 
leaving amount.  The  major  disadvantage  of  store-through  is  that  the  minimal  traffic 
intensity  equals  the  frequency  of  stores.  If  the  frequency  of  stores  is  low  and  the 
stores  are  evenly  distributed  in  the  reference  pattern  of  the  process  the  debilitating 
effects  of  store-through  on  the  netv\ork  may  be  minimal. 

A  no-store-allocate  policy  is  generally  used  with  store-through,  i.e.  a  line  is 
fetched  only  on  a  load  miss.  This  policy  does  not  provide  a  minimal  miss  ratio  since 
a  store  followed  by  a  load  causes  a  cache  miss  (if  the  line  is  not  resident  at  the  time 
of  the  store).  Store-then-load  is  a  typical  sequence  for  stack  references.  To  reduce 
the  miss  ratio  a  store-allocate-non-fetch  (SAXF)  policy  can  be  employed,  i.e.  space 
for  a  line  is  allocated  in  the  cache  and  the  store  is  reflected  in  the  cache,  but  a  fetch 


■"a  cache  flush  operation  copies  all  the  modified  data  in  the  cache  back  to  central  memory 
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for  the  (remainder  of  the)  line  is  not  generated.  Subsequent  loads  to  a  modified 
datum  can  be  satisfied  by  the  cache  regardless  of  the  residency  of  the  rest  of  the 
line.  The  policy  requires  that  residency  bits  be  maintained  in  each  the  tag  store 
entries^  for  the  smallest  addressable  unit.  A  store-through  policy  with  SANT  may 
be  acceptable  in  situations  where  store-through  with  no-store-allocate  is  not  smce 
the  level  of  traffic  generated  by  stores  may  be  offset  by  the  reduction  in  the  number 
of  load  misses.  Also  a  cache  miss  has  a  direct  effect  on  PE  performance  while 
traffic  intensity  is  a  second  order  effect. 

A  store  buffer  can  be  used  with  store-through  (using  no-store-allocate)  to 
reduce  the  effects  of  store-then-load  and  reduce  memory  traffic.  The  buffer,  being 
maintained  in  a  FIFO  fashion,  contains  the  most  recent  stores.  Stores  are  added  to 
the  buffer  and  are  issued  to  central  memory  only  when  the  buffer  becomes  full  and 
a  vacant  position  must  be  created.  If  an  address  is  already  resident  in  the  buffer, 
the  buffer  is  updated  to  reflect  the  second  store.  If  a  load  miss  occurs,  the  store 
buffer  is  examined.  If  the  address  is  in  the  buffer  the  data  is  returned  to  the  PE 
(dirty  bits  must  be  maintained  for  the  data  in  the  buffer  to  ensure  that  the  PE 
receives  the  correct  data).  This  scheme  has  the  advantage  of  requirmg  less  book- 
keeping information  than  a  SANE  policy;  however,  it  is  more  complex. 

The  major  advantage  of  store-in^  is  that  the  amount  of  network  traffic  pro- 
duced while  a  process  is  being  executed  is  a  function  of  the  cache  miss  ratio  and  line 


-A  tag  entry  would  be  similar  lo  the  one  in  figure  3.3,  excluding  the  Filling  bit. 
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size  and  not  of  the  frequency  of  stores.  The  lockup-free  attribute  of  the  cache 
minimizes  the  effect  of  store  misses:  when  a  store  miss  occurs  the  MTB  issues  a 
fetch  for  the  missed  line,  space  is  allocated  for  the  Hne  in  the  MDB.  the  line  is 
modified  in  the  MDB  to  reflect  the  store,  the  appropriate  residency  bits  are  set  in 
the  MTB,  and  the  PE  request  is  immediately  acknowledged.  Subsequent  stores  to 
the  same  line  are  reflected  in  the  VIDB  until  the  entire  line  is  fetched  from  memory. 
As  the  line  is  received  from  central  memory  only  unmodified  portions  of  the  line 
are  loaded  into  the  MDB.  If  the  PE  issues  a  load  request  for  a  datum  in  the  par- 
tially assembled  line,  the  load  is  immediately  acknowleded  provided  the  request  was 
for  a  modified  datum  or  a  datum  received  from  central  memory,  otherwise  the  PE 
waits  until  the  datum  returns  from  central  memory. 

To  support  a  store-in  policy  a  substantial  amount  of  bookkeeping  is  necessary. 
The  number  of  lines  maintained  by  the  MDB  must  equal  the  ma.ximum  number  of 
outstanding  lines  regardless  of  the  line  fetch  algorithm.  Dirty  bits  are  required  in 
the  MTB  for  each  addressable  unit.  Futhermore,  dirty  bits  are  required  in  the  tag 
store  for  each  addressable  unit.  The  tag  store  dirty  bits  are  necessary  since  more 
than  one  cache  may  have  a  copy  of  a  shared  cacheable  line,  where  each  PE  (by 
software  coherence  checks)  is  only  able  to  modify  a  portion  of  the  line.  (For  exam- 
ple, one  PE  may  modify  vector  element  A(5)  and  another  PE  may  modify  element 
A(6),  where  A(5)  and  A(6)  are  in  the  same  cache  line).  When  central  memory  is 
updated  only  the  portion  of  the  line  modified  by  a  PE  can  be  sent  to  central 
memory,  otherwise  all  updates  would  be  lost  except  for  one. 


A  storc-allocate-fetch  policy  is  assumed  to  be  used  with  storc-in  unless  otherwise  stated. 
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While  a  process  is  executing  the  traffic  produced  by  store-in  is  low.  However, 
when  execution  of  a  process  terminates  central  memory  must  be  updated,  the  update 
being  achieved  by  a  flush  mechanism:  all  the  modified  data  in  the  cache  is  copied  to 
central  memory.  The  amount  of  modified  data  is  a  function  of  the  frequency  of 
stores,  program  locality,  and  the  duration  of  process  execution. 

To  reduce  the  number  of  data  items  that  must  be  flushed  a  cleanse  function  can 
be  employed.  The  cleanse  function  resets  all  the  dirty  bits  over  an  address  range. 
When  a  process  returns  from  a  procedure  call  the  cleanse  function  can  be  invoked 
to  reset  the  dirty  bits  of  the  old  stack  range.  The  cleanse  function  is  more  advanta- 
gous  than  a  release  (Gottlieb  et.  al.  [83])  in  such  circumstances^  since  the  stack  size 
does  not  vary  rapidly  or  vary  by  large  amounts  (Ditzel  and  McLellan  [82]):  If  the 
old  stack  range  is  released  and  then  another  procedure  is  invoked  a  series  of  line 
fetches  will  be  generated:  if  the  old  stack  range  was  cleansed  no  fetches  would  be 
generated. 

A  store-allocate-fetch  policy  is  generally  used  with  store-in  (and  has  been 
assumed  in  the  foregoing  discussion),  i.e.  a  line  is  fetched  whenever  a  store  miss 
occurs.  However,  similar  to  store-through  a  SANF  policy  can  be  employed  with 
store-in.  A  cache  with  such  an  organization  would  not  produce  any  network  traffic 
from  store  requests  (except  during  a  flush  or  eviction). 


A  release  invalidates  all  lines  in  cache  over  an  address  range.    The  cleanse  function  is  not  a  re- 
placement for  a  release;   release  is  still  necessary  to  invalidate  data  in  the  cache. 
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3.5.1.   Characterization  of  Private  and  Shared  Cacheable  Stores 

To  determine  which  update  policy  will  provide  the  best  performance  the 
characteristics  of  stores  must  be  kno\\n.  This  sections  presents  a  characterization  of 
private  and  shared  cacheable  stores.  Also  presented  is  how  the  stores  will  affect  the 
update  policies. 

The  frequency  of  private  stores  is  likely  to  be  much  higher  than  the  frequency 
of  shared  cacheable  stores  since  many  private  stores  are  generated  by  stack  pushes 
(e.g.  holding  temporaries  on  the  stack  and  using  the  stack  for  procedure  invoca- 
tions). Since  many  private  stores  are  stack  pushes  the  stores  are  typically  clustered, 
especially  during  procedure  invocation.  Private  data  is  also  likely  to  have  a  high 
degree  of  temporal  locality  since  stack  space  is  re-used. 

The  frequency  of  shared  cacheable  stores  is  likely  to  be  low  relative  to  other 
request  types  and  is  likely  to  be  distributed  evenly  over  other  requests:  Consider  a 
DOALL  structure  where  each  process  is  computing  a  portion  of  an  aggregate  result. 
Many  load  references  to  shared  cacheable  data  may  occur,  but  the  number  of  shared 
cacheable  stores  for  a  given  loop  iteration  is  small,  often  just  the  result.  If  chunking 
is  used  and/or  if  the  DOALL  structure  contains  a  serial  loop  the  total  number  of 
stores  may  be  large,  but  the  relative  frequency  of  shared  cacheable  stores  is  low 
relative  to  other  request  types. 

During  process  execution  store-through  is  likely  to  produce  an  acceptable 
amount  of  network  traffic  for  shared  cacheable  data,  but  an  unacceptable  amount 
for  private  data:  Since  the  frequency  of  shared  cacheable  data  is  low,  the  use  of  a 
store-through  policy  will  not  generate  a  substantial  amount  of  network  interference 
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for  shared  cacheable  stores.  However,  since  stores  to  private  data  is  clustered,  a 
store-through  poHcy  is  likely  to  cause  a  significant  amount  of  network  interference 
for  private  stores. 

Conversely,  a  store-in  policy  is  likely  to  be  acceptable  for  private  stores,  but 
unacceptable  for  shared  cacheable  stores.  Since  private  stores  have  a  high  degree  of 
temporal  locality,  a  store-in  policy  will  minimize  the  amount  of  network  traffic  gen- 
erated for  private  stores.  However,  shared  cacheable  data  many  not  have  good 
locality  and  a  store-in  policy  with  store-allocate-fetch  many  pollute  the  cache  with 
extraneous  data.  Moreover,  at  process  termination  the  shared  cacheable  data  must 
be  flushed  from  the  cache,  and  the  amount  of  data  that  must  be  flushed  may  be  sub- 
stantial. The  use  of  a  SANF  policy  with  store-in  could  reduce  the  amount  of  traffic 
generated  by  store  during  process  execution,  but  the  SANF  policy  would  not  reduce 
the  amount  of  data  that  needs  to  be  flushed  from  the  cache  a  process  termination. 

Since  no  single  policy  appears  to  provide  satisfactory  peformance,  a  "com- 
bined" policy  can  be  used.  The  combined  policy  would  consist  of  a  store-in  policy 
for  private  data  and  a  store-through  policy  for  shared  cacheable  data.  Such  a  policy 
provides  the  advantages  of  store-in  and  store-through:  minimizes  traffic  due  to 
private  stores  and  eliminates  the  need  for  a  cache  flush  for  shared  cacheable  data. 
The  combined  policy  has  the  advantage  tat  it  minimizes  the  amount  of  traffic  for  the 
most  frequent  type  of  stores  (i.e.  private  stores)  and  minimizes  the  amount  of  data 
that  must  be  flushed  at  process  termination. 


CHAPTER  4 
PERFORMANCE  ANALYSIS 

Two  fundamentally  different,  but  complementary  approaches  can  be  employed 
in  evaluating  the  performance  of  a  cache  system:  analytic  models  and  simulation 
models.  Analytic  models  use  a  number  of  mathematical  techniques,  predominantly 
from  queuing  theory,  to  form  relationships  between  design  parameters  and  perfor- 
mance criteria.  Simplifying  assumptions  often  must  be  made  to  attain  tractable 
equations.  The  choice  of  such  simphfications  is  delecate;  inappropriate  simplifica- 
tions may  lead  to  a  model  that  no  longer  adequately  reflects  the  system  being 
modeled  or  the  simplifications  may  not  provide  a  tractable  model. 

Simulations  involve  conducting  experiments  that  approximate  the  dynamic 
behavior  of  a  system  being  modeled.  Simulation  is  more  versatile  than  analytic 
models  since  simulation  models  can  be  constructed  to  low  levels  of  detail,  provide 
more  statistical  information  and  provide  the  means  to  study  transient  behavior  in  the 
system  being  modeled  (Heidelberger  and  Lavenberg  [84]).  Simulation  models  may 
either  be  trace-driven  or  self-driven.  Trace-driven  models  use  data  obtained  from 
operational  systems  while  self-driven  models  use  probabilistic  data.  Difficulties 
arise  in  simulating  systems  not  yet  constructed.  The  model  used  may  not  accurately 
reflect  the  system  once  it  is  constructed,  and  the  data  chosen  to  drive  the  simulations 
may  not  accurately  reflect  the  workload  performed  by  the  system. 


46 


47 

The  performance  analysis  for  the  cache  designs  presented  in  this  thesis  will  be 
done  using  both  analytic  and  simulation  models.  Since  the  system  being  modeled  is 
complex  and  not  yet  constructed,  both  types  of  modeling  are  necessary  in  order  to 
obtain  meaningful  results. 

4.1.   Analytical  Model 

Presented  in  this  section  are  approximate  analytic  models  for  evaluating  the 
cache  parameters  and  policies  presented  in  chapter  3  in  terms  of  average  waiting 
times  in  queues  and  PE  utilization.  The  models  developed  in  this  section  are  used 
to  show  general  characterizations  about  the  system.  They  are  not  intended  to  accu- 
rately predict  waiting  times  and  utilizations,  but  used  to  study  the  behavior  of  sys- 
tem components  as  parameter  value  and  policies  are  altered.  Simulators,  described 
in  section  4.2,  are  used  to  examine  the  performance  of  the  system  more  accurately. 

The  modeling  technique  presented  in  this  section  for  approximating  the  utiliza- 
tion of  the  PEs  in  a  parallel  system  is  based  on  work  by  Patel  [82].  Patel  developed 
a  model  to  study  the  performance  of  private  caches  in  multiprocessor  systems  that 
employ  a  circuit  switched  connection  network.  The  model  predicts  PE  utilization  in 
terms  of  network  transit  time  and  the  waiting  time  needed  to  obtain  a  path  to  the 
desired  MM.  In  the  study  Patel  did  not  consider  the  cache  coherence  problem.  In  a 
more  recent  study  Papamarcos  and  Patel  [84]  use  the  same  modeling  technique  to 
study  the  performance  of  caches  in  a  multiprocessor  system  using  a  shared  bus.  In 
this  study  the  effects  of  maintaining  cache  coherence  by  hardware  are  included  in 
model. 
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Briggs  and  Dubois  [83]  have  also  modeled  the  effects  of  private  caches  in  a 
multiprocessor  system.  In  their  system  separate  memories  are  used  for  cacheable 
and  noncacheable  (shared)  data.  The  shared  memory  is  composed  of  a  two  dimen- 
sional memory  organization  called  a  L-M  memory  organization  (Briggs  and  David- 
son [77]).  The  L-M  organization  is  used  to  reduce  the  total  access  time  to  fetch  a 
cache  line.   The  network  used  in  the  model  is  a  crossbar. 

The  model  presented  in  this  thesis  differs  from  the  models  mentioned  above  in 
that  a  packet  switched  network  is  used,  a  single  data  space  contains  both  cacheable 
and  noncacheable  data,  and  portions  of  the  same  cache  line  are  permitted  to  reside 
in  several  MMs. 

A  queuing  network  model  for  the  parallel  computer  described  in  chapter  2  is 
presented   in    figure   4.1.     A    PE -cache   pair   is   the   source   and   sink   of  memory 


PE 
Cache 


PNI 


K> 


Network 
Figure  4.1.   Queueing  network  model 
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requests.  Only  one  PE-cache  pair  is  modeled  since  each  pair  is  assumed  to  have  the 
same  request  rate.  The  effects  of  other  PEs  on  network  traffic  is  modeled  indirectly 
in  the  queuing  delays  of  the  network  switches  and  the  MMs.  Only  one  MNI  (in  and 
out)  is  modeled  since  requests  to  shared  memory  are  assumed  to  be  uniformly  dis- 
tributed over  all  of  the  MMs.  The  rate  at  which  requests  are  generated  by  the 
source  is  a  function  of  the  request  frequency  for  each  storage  class,  the  cache  miss 
ratio,  the  cache-memory  update  policy,  and  the  utilization  of  the  PE.  The  request 
frequencies  and  cache  miss  ratios  are  parameters  to  the  analysis. 

The  utilization  of  a  PE  is  determined  as  follows:  Each  PE  can  be  in  one  of  two 
states,  performing  useful  computations  and  issuing  memory  requests  or  waiting  for 
a  memory  request  to  return  from  an  MM.  For  C  units  of  computation  the  processor 
may  issue  Cvy,  requests  for  which  it  has  to  wait  for  a  response,  where  r„,  is  the  aver- 
age frequency  of  requests  per  computational  unit  not  satisfied  by  the  cache.  Let  r  be 
the  average  transit  time  for  a  single  request  (i.e.  time  require  to  transmit  a  request 
to  an  MM,  service  the  request,  and  transmit  the  responce  back  to  the  PE).  Then, 
for  C  units  of  useful  computation  C  +  CtVy,  units  are  needed,  thus  the  utilization  of 
a  PE  is: 

Before  calculating  r^,  the  following  parameters  are  defined:  Letp,  s,  and  sc  be 
subscripts  to  denote  private,  shared  non-cacheable,  and  shared  cacheable,  respec- 
tively.   Let  /,  L,  S,  and  F  denote  the  average  request  frequencies  for  instructions, 
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loads\  stores,  and  fetch-and-phi's^,  respectively.  Some  notational  examples  are  as 
follows:  /  is  the  average  PE  request  frequency  for  instructions;  L^c  is  the  average 
request  frequency  for  shared  cacheable  loads;  and  Sp  is  the  average  request  fre- 
quency for  private  stores.  /  and  F  are  not  subscripted  since  instructions  are  always 
cacheable  and  Fetch-and-phi's  are  assumed  to  be  non-each eable.  The  cache  miss 
ratios  are  denoted  as  follows:  /'  is  the  instruction  miss  ratio,  fp  and  fp  are  the 
private  data  load  and  store  miss  ratios,  and  f^c  and  f^c  are  the  shared  cacheable  load 
and  store  miss  ratios.''  Lastly  fl  is  defined  to  indicate  the  memory  update  policy  and 
r  is  defined  to  indicate  whether  a  fetch  or  non-fetch  poHcy  is  to  be  used  with  store- 
in 


\Qi  if 


store  -in  , .  ^  x 

(4.2a) 
store  —through 


T-         Jl    if  fetch 

^x  =  j_    ./  .     .  (4.2b) 

[0   if  non  -fetch 

where  x  is  p  or  sc.  A  store-allocate  policy  is  assumed  to  be  used  with  store- in;  T 
indicates  if  a  line  fetch  is  generated  on  a  store  miss.  The  effect  of  a  non-store- 
allocate  or  store-allocate-non-fetch  policy  used  with  store-through  is  reflected  in  the 
load  miss  ratios,  therefore  a  parameter  that  specifies  which  of  the  policies  is  used 
with  store-through  is  not  necessary. 

The  following  are  average  request  frequencies  (in  PE  cycles)  that  are  not  satis- 


^For  the  remainder  of  the  thesis  loads  refer  to  data  loads,  unless  otherwise  stated. 

^etch-and-phi  is  a  read-modify-write. 

Separate  miss  ratios  are  used  for  instruction  fetches  and  data  loads  and  stores  because  the  amount 
of  network  traffic  generated  by  a  particular  request  type  varies  depending  on  the  update  policy. 


51 


fied  by  the  cache 


r'  =  //'■  (4.3a) 

r'p  =  Lpfp  (4.3b) 


Thus,  r^  is 


r[c  =  Lsc/sc  (4.3c) 

r^  =  Ls  +  Ss  +  F  (4.3d) 


r^  =  r'"  +  r'p  +  rj,  +  rf  (4.4) 


The  variable  r^  represents  the  frequency  of  requests  the  PE  must  await  before 
continuing  useful  computation.  Note  that  the  cache  line  size  is  only  represented  in 
the  cache  miss  ratios  and  not  directly  in  request  frequency  since  the  PE  only  waits 
for  a  request  that  generates  an  instruction  or  load  miss.  Private  and  shared  cache- 
able  stores  also  have  no  effect  on  this  request  frequency  regardless  of  the  update 
policy  (when  a  store-through  update  policy  is  used  the  processor  does  not  have  to 
wait  for  the  completion  of  the  store,  and  since  the  cache  is  lockup-free,  when  a 
store-in  policy  is  used  updates  are  made  in  the  cache  regardless  of  a  line's 
residency,  see  section  3.5).^  However,  it  will  be  shown  subsequently  that  cache  line 
size  and  cacheable  stores  do  have  a  noticeable  effect  on  the  utilization  of  the  PEs. 

To  compute  the  average  transit  time,  /,  in  eq.  (4.1)  the  average  waiting  time  of 
the  queues  depicted  in  figure  4.1  must  be  found.  The  following  sections  present 
models  that  approximate  these  average  waiting  times.    In  these  models  requests  are 


Vf  is  conserative  since  a  PE  can  continue  issuing  references  to  private  data  while  a  single  refer- 
ence to  shared  data  is  outstanding. 

Hf  the  cache  locked  up  on  store  misses,  denoted  with  the^  superscript  /m,  the  frcqucn«r  of  store 
misses  must  be  added  to  r^.,  thus  r'^  =  r'  +  Tp  +  Vsc  +  ''i   +  ^pfp^p^p  "t"  ^scfsc^sc^ sc 
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assumed  to  be  uniformly  distributed  over  all  the  MMs  and  a  request  is  assumed  to 
be  independent  of  all  other  requests.^  System  queues  are  assumed  to  be  unbounded 
and  able  to  accept  any  number  of  arrivals  in  a  single  cycle.  A  cycle  (network 
cycle)  is  defined  as  the  time  to  transmit  a  packet  (unit  of  interswitch  data  transfer) 
between  two  switches.  A  single  request  transmitted  through  the  network  is  called  a 
message.   A  message  is  composed  of  one  or  more  packets. 

To  find  an  approximate  model  for  the  average  waiting  times  of  the  queues  dep- 
icted in  figure  4.1  a  common  recurrence  relation  and  derivation  is  used  to  obtain  the 
expected  number  of  customers  in  a  queue  and  the  average  system  time.  Since  the 
recurrence  and  derivation  are  used  in  all  the  queuing  models,  both  are  presented 
here. 

Let  the  random  variable  v„  be  the  number  of  arrivals  at  a  queue  during  service 
cycle  n,  and  let  the  number  of  arrivals  be  independent  from  the  arrivals  at  any  other 
service  cycle.  Let  5„  be  the  number  of  requests  in  the  queue  at  the  end  of  service 
cycle  n.   The  following  recurrence  relation  can  be  written  for  qn  +  \ 


9n  +  l   = 

The  minus  one  represents  the  departure  of  a  request  from  the  queue  (which  can 
only  occur  if  the  queue  is  nonempty)  between  service  cycle  n  and  n  +  \. 

This  recurrence  is  "formally  identical"  to  eq.  (5.33)  of  Kleinrock  [75]  describ- 
ing  the   number   of  customers   at   departure   times   in   a  M/G/1    queuing  system. 


^his  last  assumption  is  clearly  an  approximation  since  the  multiple  requests  needed  to  fetch  a 
sinle  cache  line  are  not  independent.   The  assumption  is  made  to  simplify  the  analysis. 


53 

Therefore,  a  derivation  that  leads  to  the  Pollaczek-Khinchin  mean-value  formula  is 
used  to  obtain  the  expected  number  of  requests  in  a  queue:^ 

-  =       v(v)       +  Mil 

^        2(l-£(v))  2 

where  E  is  the  expectation  and  V  is  the  variance.    By  Little's  formula,  q  =  T-E{v), 

the  average  system  time  is  found: 

^-    2E(v)(l-E(v))    ^    2  (^-^^ 

The  average  waiting  time  for  a  queue,  w,  can  then  be  found  by  subtracting  the  ser- 
vice time  from  T. 

4.1.1.   PNI  queue 

A  function  of  the  PNI  is  to  disassemble  a  request  to  shared  memory  into  a 
sequence  of  message  packets  to  be  transmitted  over  the  connection  network;  the 
number  of  packets  in  a  message  is  dependent  on  the  request  type,  word  size,  and 
packet  size  (which  is  a  function  of  the  network  bandwidth).  A  cache  miss  will  gen- 
erate one  or  more  messages,  the  number  of  messages  being  dependent  on  the  cache 
line  size  and  the  interleaving  granularity.  Arrivals  to  the  PNI  queue  are  generated 
by:^ 

1)  a  cache  miss 

2)  a  non-cacheable  load 

3)  a  non-cacheable  store 

4)  a  fetch-and-phi 


^The  derivation  is  valid  approximation  for  any  arrival  distribution  provided  the  number  of  arrivals 
is  independent  of  •  the  number  of  requests  in  the  queue  and  the  following  limit  holds: 
lim  E  [qi]  =  E[q  ]  where  q  is  the  limiting  distribution  for  the  random  variable  q„_ 


/I -00 

*Cache  evictions  are  not  considered  in  the  model 
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5)  a  store,  if  store-through  is  used 

To  find  the  probability  of  an  arrival  at  the  PNI,  the  frequency  of  each  of  the  above 
must  be  found. 

The  average  frequency  of  a  cache  miss,  co,  is  given  by  the  sum  of  eqs.  (4.3a), 
(4.3b),  and  (4.3c)  plus  the  frequency  of  store  misses  (if  store-in  is  used  with  store- 
allocate-fetch) ,^  i.e. 

CO  =  r'  +  r^  +  ric  +  Spfp^pYp  +  S^cfsc^sc^sc  (4.7a) 

The  average  frequency  of  non-cacheable  loads  is  L^,  of  non-cacheable  stores  is  5^, 

and  of  fetch-and-phi's  is  F.    The  frequency  of  store  requests  generated  using  a 

store-through  policy  is: 

CT  =  (l-np)5p  +  (l-fi,,)5,,  (4.7b) 

Let  Upe  be  the  utilization  of  the  PE  and  Cpg  be  the  number  of  network  cycles 
per  PE  cycle.  Then  the  probability  of  an  arrival  at  the  PNI  in  a  network  cycle  is 
given  by 

Ppni  =  f/pJ(o  +  a  +  ri^lCp,  (4.7) 

The  number  of  messages  generated  by  a  cache  miss  is 

,                      line  size  . .  „  s 

<t>  =  -. ; — : ; — —  (4.8a) 

interleaving  granularity 
The  probability  of  a  message  leaving  the  PNI  and  arriving  at  the  first  stage  of 

the   network  is 


'For  a  lockup  cache  co'"  =  r-H 
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Upe(<i>oi  +  (T  +  ri) 
P  =        -p (4.8) 

To  find  the  waiting  time  tor  the  P\I  queue  the  distribution  of  v  (from  eq.  4.5) 
must  be  specified:  Assume  that  all  requests  arriving  at  the  PNI  queue  will  be 
transmitted  over  the  network  as  a  message  consisting  of  m  packets  (i.e.  message 
size  is  m);  thus  the  service  time  for  a  request  is  m  cycles  (since  message  size  equals 
service  time).  The  maximun  number  of  arrivals  to  the  PNT  queue  during  a  service 
cycle  is  m.  The  distribution  of  v  can  be  approximated  by  a  Bernoulli  distribution  of 
m  trials  with  a  probability  of  a  success  being  Ppni,  ie.  distribution  of  v  can  be 
approximated  by  b(-;tn,pp„i). 

The  above  distribution  assumes  that  arrivals  could  occur  during  every  service 
time;  however,  this  is  not  valid  sinci  a  non-prefetch  read  miss,  a  fetch-and-phi,  or  a 
non-cacheable  load  or  store  causes  the  PE/cache  source  to  stop  issuing  requests  until 
the  outstanding  request  returns  from  the  MMs  (i.e.  lockup).  This  effect  can  be 
approximated  by  multiplying  the  distribution  by  the  probability,  q,  that  the  request 
being  serviced  is  not  one  of  the  above.  (This  only  approximates  the  effect  of  lockup 
since  the  request  that  caused  the  lockup  may  be  on  the  queue  and  not  yet  in  ser- 
vice.)   The  distribution  of  v  can  be  approximated  by: 


V 


_     b{\m.ppni)    with  proh  q 
0  with  proh  1  —  q 

The        expected        value        is        E  {\)  =  mp  p^q        and        the         variance         is 
V(  v)  =  mpp„,q{  1  -pp„,)  +  m  -pj„,q-m-pl„,q'^ . 

Substituting  into  eq.  (4  6)  the  average  system  time  per  service  cycle  is  found; 
multiplying  by  the  average  service  time,  m.  produces  the  average  system  time  per 
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network  cycle,  i.e. 

^    mpp„i(m-mq-l)  +  m         rn_ 
2(\-mpp„iq)  2 

To  obtain  the  average  waiting  time  the  average  service  time  is  subtracted  from  T, 
giving 

-  ^  m  (4.9) 


Wnni    = 


^"'         2{\-mppniq) 

The  messages  arriving  at  the  PNI  queue  will  not  be  of  one  size  as  assumed 
above,  but  will  be  of  a  few  different  sizes,  say  n.  To  incorporate  n  different  mes- 
sage sizes  into  the  model,  the  distribution  of  v  is  defined  as  a  composition  of  n  Ber- 
noulli distributions  b{-;m\,pp„i),...,b{-;mn,Ppni)  where  thfe  V^  distribution  occurs 
with  probability  qi  (qi  being  the  probability  of  an  arrival  of  a  message  of  size  m,). 
The  average  service  time  is  then  equal  to  the  average  message  size  mp„i. 

The  message  size  of  a  cache  line  fetch  is  ^a,  where  a  is  the  message  size  of  a 
load  request  (consisting  of  a  MM  number,  memory  offset,  and  memory  operation): 
A  line  fetch  arrives  at  the  PNI  as  a  single  request  and  the  PNI  generates  4>  messages 
of  size  a.  The  <{>  messages  are  issued  before  the  PNI  services  another  request.  In 
the  network  the  4)  messages  are  treated  as  independent  requests. 

4.1.2.   Switch  Output  Queue 

As  described  in  chapter  2  a  network  switch  has  k  input  ports  and  k  output  ports 
with  a  queue  associated  with  each  output  port.  The  output  queues  are  capable  of 
accepting  a  packet  from  each  of  the  k  input  ports  at  each  cycle.  Kruskal  and  Snir 
[83]  present  an  analysis  of  the  queueing  delay  of  the  output  queues;  the  following 
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summarizes  their  results  for  the  delay  at  the  first  stage  of  a  buffered  interconnect 
network.  Kruskal  et  al.  [84]  present  a  new  derivation  for  determining  switch  wait- 
ing times.  This  derivation  permits  them  to  derive  more  statistical  information  about 
the  queue,  such  as  variance. 

Let  p  be  the  probability  of  an  arrival  at  an  input  port  and  let  an  arrival  be 
equally  likely  to  join  any  one  of  the  k  output  queues.  Let  v„  be  the  number  of  mes- 
sages joining  a  fixed  output  queue  at  cycle  n,  where  a  message  is  comprised  of  only 
one  packet.  Assuming  the  number  of  arrivals  is  independent  for  each  n,  the  arrival 
distribution  is  given  by  the  Bernoulli  distribution  b{-,k,p/k).  Substituting  into  eq. 
(4.6)  the  average  system  time  is  obtained 

(1--^) 
^  k      ^    1 

2(l-p)         2 
Subtracting  the  service  time  and  simplifying  the  waiting  time  is  obtained 

_       ^^^"i^  (4.10) 

w  =  

2(l-p) 

A  single  packet  is  not  likely  to  represent  a  full  message,  rather  a  message  will 
consist  of  several  packets,  say  m.  In  this  case  each  message  is  said  to  be  time  multi- 
plexed by  a  factor  m,  and  messages  are  assumed  to  arrive  every  m""  cycle.  If  tc  is 
the  cycle  time  to  process  a  single  message,  the  cycle  time  to  process  a  packet  is 
Tc  =  mtc-  The  average  number  of  packets  per  cycle  will  now  be  P  =  mp.  Substi- 
tuting into  eq.  (4.10)  the  average  waiting  time  is 

_^  ^  ^'^(^  -  b  (4.11) 

"^^•^  2{\-mp) 
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Eq.  (4.10)  was  verified  by  Kruskal  and  Snir  by  simulation.  The  predicted 
delays  for  the  first  stage  were  in  good  agreement  with  the  simulated  delays.  The 
simulations  showed  that  subsequent  stages  of  the  network  had  larger  delays  than  the 
first  stage.  The  increases  are  caused  by  the  second  and  subsequent  stages  of  the 
network  not  maintaining  time  independent  arrivals.  Eq.  (4.11)  was  also  verified 
and  it  provided  good  approximations  for  later  stages  in  the  network  for  m>l. 
Though  eq.  (4.11)  provides  a  good  approximation  for  the  queuing  delay  for  a  single 
message  size,  it  can  not  reflect  the  delays  when  various  message  sizes  are  transmit- 
ted through  the  network  (due  to  the  "time  multiplexed"  argument). 

A  model  can  be  constructed  which  does  not  rely  on  a  "time  multiplexed"  argu- 
ment by  assuming  the  queue  server  to  have  a  service  time  of  m  cycles.  The  number 
of  arrivals  v„  during  service  cycle  n  is  given  by  the  Bernoulli  distribution 
b(;mk,p/k).  This  distribution  is  similar  to  Kruskal  and  Snir,  except  the  number  of 
trials  is  mk  instead  of  k  due  to  the  service  time  of  m.  The  expected  value  is 
^(v)  =  mp  and  the  variance  is  V(v)  =  mp(l—p/k). 

Substituting  into  eq.  (4.6)  and  multiplying  by  m  (since  a  service  cycle  is  m 
cycles)  the  system  time  is  obtained: 

r= —  +  ^ 

2(l-mp)  2 

To  obain  the  average  queue  delay  the  average  service  time,  m,  is  subtracted  from  T, 

thus 

m^pil  -  T^) 
km  (4.12) 

2{l- mp) 
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When  n  different  message  sizes  are  considered  the  distribution  of  v  is  defined  as  a 

composition  of  n  Bernoulli  distributions  bi-;kmi,-^),...,b{;km„,-^)  where  the  i'^ 

distribution  occurs  with  probability  qi. 

Eq.  (4.12)  produces  waiting  times  that  are  larger  than  the  results  obtained  by 
Kruskal  and  Snir  --  eq.  (4.11).  This  is  due  to  the  1/km  term  instead  of  l/k.  A 
result  identical  to  eq.  (4.11)  can  be  derived  (without  a  "time  multiplexed"  argu- 
ment): Instead  of  considering  mk  independent  trials  during  a  service  cycle,  consider 
only  k  trials  with  the  probability  of  an  arrival  being  m  times  greater,  i.e.  the  proba- 
bility of  an  arrival  is  mp.  The  probability  is  guaranteed  to  be  <1  since  the  network 
can  not  transmit  a  message  faster  than  one  packet  per  cycle,  i.e.  p  is  guaranteed  to 
be  <l/m.  Thus  the  distribution  of  v  is  given  by  the  Bernoulli  distribution 
b(-;k,mpJk).  Substituting  into  eq.  (4.6)  the  expected  system  time  per  service  cycle 
is  found.  An  approximate  average  waiting  time  is  then  found  by  multipling  by  m 
and  subtracting  the  service  time.   Thus  the  average  waiting  time  is: 

_  ^   m^pd  -  ^)  ^^^3^ 

^  2(l-mp) 

Note  that  eq.  (4.13)  is  identical  to  eq.  (4.11).    When  n  different  message  sizes  are 

considered  the  distribution  of  v  is  defined  as  a  composition  of  n  Bernoulli  distribu- 
tions b{;k, — —),...,b(-;k, — ^^—)  where  the  i""  distribution  occurs  with  probability 
k  k 
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4.1.3.  MNI,„  Queues 

The  MNIjn  queue  contains  requests  received  from  the  network  that  are  pending 
memory  service.  The  MNI,>,  can  be  modeled  in  a  fashion  similar  to  the  network 
switches.  The  probability  of  a  message  arriving  during  any  cycle  is  given  by  p.  Let 
Cmm  be  the  service  time  (cycle  time)  of  a  MM  and  a  service  cycle  be  the  time 
needed  to  complete  service  for  a  single  customer.  The  number  of  arrivals,  v,  at  the 
MNI,„  queue  can  be  approximated  by  by  the  Bernoulli  distribution  b{-;Cmm,p)- 
Substituting  into  eq  (4.6)  the  expected  system  time  per  service  cycle  is  found;  mulit- 
plying  by  the  service  time  produces  the  system  waiting  time  per  network  cycle,  i.e. 

T  =   ^.fc~'l   +  ^  (4.15) 

Then  the  service  time  must  be  subtracted  from  eq.  (4.15)  to  obtain  an  approximate 
average  waiting  time 

_  ^mm  (4.16) 

w  ^  

2{\-CmmP) 

4.1.4.  MNIouf  Queue 

A  MNIour  queue  contains  requests  which  have  been  serviced  by  a  MM  and 
which  are  waiting  to  be  transmitted  over  the  connection  network.  Requests  enter 
the  queue  as  a  single  message  and  are  transmitted  as  a  sequence  of  packets,  identi- 
cal to  the  PNI  queue;  therefore,  the  queue  is  modeled  in  a  fashion  similar  to  the  PNI 
queue. 

The  probability,  Pmniaa,  of  an  arrival  at  the  MNIouf  is  dependent  on  the  MM 
cycle  time  in  terms  of  network  cycles,  Cmm,  and  the  utilization  of  the  MM,  pCmm', 
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thus 


P^  mm     

Pmnicut  ~"   ~p  ~  P 

*^  mm 


Let  m  be  the  size  of  a  message  (in  packets)  arriving  at  the  MNIour  queue.  The 
number  of  arrivals,  v,  at  the  queue  can  be  approximated  by  the  Bernoulli  distribu- 
tion b{-;m,p).  Eq.  (4.6)  is  used  again  to  obtain  the  expected  system  time.  An 
approximate  average  waiting  time  is  found  by  multiplying  the  system  time  per  ser- 
vice cycle  by  m  and  subtracting  the  service  time.   Thus 

m?p(l-— ) 
-  =  ^'  (4.17) 

'""'*'  2(l-m,;p) 

A  few  different  message  sizes  can  be  accomodated  by  defining  v  as  a  composite 
distribution  as  described  in  section  4.1.1. 

4.1.5.   Average  Transit  Time 

The  average  transit  time  in  eq.  (4.1)  can  be  defined  in  terms  of  server  times 
and  the  queue  waiting  times  presented  in  the  previous  sections.  Let  K  be  the 
number  of  stages  in  the  connection  network  and  m^  and  m,  be  the  average  message 
size  transmitted  from  the  PNI  to  a  MM  and  from  a  MM  to  the  PNI  respectively. 
Thus,  the  following  average  system  times  can  be  defined: 

Tpni  =  'Wpni  +  frio 

Tmm   ~   ^mniin    "''   Cmm   "t"   ^mniaa   ''     ^i 


62 

The  average  transit  time  is  defined  as 

t  =   Tpni   +   2KTs^    +   Trr,^  (4.18) 

In  calculating  t  the  probability  of  a  message  arriving  at  a  queue  is  assumed  to 
be  constant  and  independent  in  all  network  stages  and  during  all  network  cycles 
(thus,  the  expected  waiting  times  for  all  the  switch  output  queues  are  assumed  to  be 
the  same  as  the  expected  waiting  times  for  the  first  stage).  Futhermore,  each  queue 
is  analysed  independently  from  the  queuing  network.  This  latter  assumption  can  be 
viewed  as  an  application  of  Jackson's  decomposition  theorem  Jackson  [63];  how- 
ever, Jackson's  theorem  applies  to  queuing  networks  with  poisson  arrivals  and 
exponential  service  times.  Though  the  assumption  may  not  be  valid,  it  simplifies  the 
analysis. 

4.2.   Simulation  Methods 

To  fully  demonstrate  and  analyze  the  primary  and  secondary  effects  of  cache 
policies  on  the  performance  of  a  highly  parallel  system  by  simulation  requires  a 
simulator  consisting  of  modules  that  accurately  represent  each  component  of  the 
parallel  system.  The  time  and  memory  requirements  to  run  such  a  simulator  are 
prohibitive.  To  provide  a  faithful  simulation  with  a  more  managable  level  of  com- 
plexity a  two  level  simulation  method  was  adopted,  reflecting  the  first  order  and 
second  order  effects  of  the  cache  on  performance.  The  primary  level  simulates  a 
single  cache.  The  second  level  simulates  a  parallel  machine.  In  both  levels  of  simu- 
lation a  PE's  memory  requests  are  generated  by  processing  trace  data  (in  the  case  of 


^"in  calculating  the  switch  waiting  times  from  the  PEs  to  the  MMs  and  from  the  MMs  to  the  PEs 
the  appropriate  message  sizes  are  used. 
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the  second  level  simulator  a  stochastic  process  can  also  be  used  to  generate  PE 
requests). ^^  The  simulators  were  written  in  PASCAL  and  run  on  a  Data  General 
MV8000.  The  remainder  of  this  section  discusses  how  the  trace  data  is  obtained 
and  describes  the  two  levels  of  simulation. 

4.2.1.  Trace  Data 

Trace  data  used  to  drive  the  simulators  is  generated  by  recording  memory 
references  of  parallel  programs  simulated  by  WASHCLOTH  (Gottlieb  [80]). 
WASHCLOTH  is  an  instruction  level  Paracomputer  (Schwartz  [80])  simulator 
whose  PEs  are  CDC  6600  (Thorton  [70]).  Programs  run  in  the  WASHCLOTH 
environment  are  standard  FORTRAN  programs  that  have  been  reorganized  to  exe- 
cute in  parallel  and  to  meet  the  restrictions  imposed  by  WASHCLOTH.  One  of 
these  restrictions  requires  shared  data  to  be  declared  as  part  of  blank  common. 
Data  not  declared  as  part  of  blank  common  is  considered  private.  Some  parallel 
constructs  have  been  added  to  the  WASHCLOTH  environment  by  Korn  [81];  the 

constructs  are:  Dowait,  Doall,  and  Enddo.  The  constructs  are  implemented  as  sub- 
routine calls  with  loop  induction  variables,  iteration  ranges,  iteration  steps,  and  a 
loop  termination  address  as  parameters. 

As  a  program  is  simulated  by  WASHCLOTH  a  trace  routine  is  called  which 


^Notc  that  the  simulators  use  reference  patterns;  they  do  not  simulate  PE  instructions  or  perform 
operations  on  "actual"  data. 

^A  Dowait  loop  contains  an  implicit  synchronization  point  at  loop  termination;  i.e.  the  statement 
logically  following  the  Enddo  is  not  executed  until  all  the  loop  iterations  are  completed.  The  Doall  does 
not  have  this  synchronization  point. 
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records  each  memory  reference  of  a  specified  PE.^-'  The  trace  data  is  collected  on  a 
process  by  process  basis,  a  process  being  composed  of  one  or  more  iterations  of  a 
Dowait  (Doall)  loop.  A  marker  is  placed  in  the  trace  data  to  indicate  the  beginning 
and  end  of  a  process'  references  and  at  the  termination  of  a  loop  interation.  When 
the  trace  data  is  used  by  the  simulator,  performance  is  analysed  on  a  process  basis. 
Performance  statistics  for  an  entire  program  can  be  generated  by  an  averaging  of 
the  process  by  process  analysis. 

Information  recorded  about  each  memory  reference  includes:  the  memory 
address,  type  of  reference  (instruction  fetch,  data  load,  or  data  store),  storage  class 
(instruction,  private  data,  or  shared  data),  cacheability,  and  instruction  length.  The 
storage  class  of  a  datum  is  determined  by  the  address  of  the  memory  reference;  a 
reference  is  to  private  data  if  the  address  is  below  the  first  location  of  blank  com- 
mon. The  cacheable  attribute  is  a  function  of  a  datum's  storage  class  and  usage. 
All  instructions  and  private  data  are  considered  cacheable,  thus  the  cacheable  attri- 
bute of  these  storage  classes  is  determined  by  their  addresses.  Since  WASHCLOTH 
only  provides  one  storage  area  for  shared  variables  a  simple  segmentation  scheme 
was  added  to  provide  a  more  flexible  storage  structure.  If  a  shared  segment  could 
be  cached,  the  beginning  and  ending  addresses  of  the  segment  would  be  passed  to  a 
routine  called  CACHABL;  CACHABL  maintains  a  list  of  all  cachable  shared  seg- 
ments. The  list  is  referenced  by  the  trace  routine  to  determine  if  a  shared  location 
is  cacheable.  A  routine  called  NONCACH  is  provide  to  remove  a  segment  from  the 
cacheable  list. 


^Only  one  PE  is  traced  since  the  reference  patterns  of  all  the  PEs  will  be  similar. 
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Since  WASHCLOTH  simulates  a  paracomputer  whose  PEs  are  CDC  6600s, 
and  since  such  PEs  are  not  likely  to  be  used  in  a  parallel  computer,  the  effects  of 
the  6600  architecture  on  cache  performance  must  be  considered:  First,  the  word  size 
of  the  6600  is  large,  60  bits;  thus  fewer  locations  of  independent  variables  can  be 
stored  in  a  given  cache  size.  Moveover,  the  6600  is  only  word  addressable;  most 
contemporary  machines  have  many  levels  of  addressability.  For  non-scientific  appli- 
cations the  large  word  size  of  the  6600  could  cause  a  higher  miss  ratio  for  a  fixed 
cache  size  since  a  greater  amount  of  memory  (i.e.  number  of  bits)  may  be  needed  to 
represent  particular  data  structures. 

Second,  a  PE-cache  bus  on  the  6600  would  be  60  bits  wide  whereas  a  PE-cache 
bus  on  a  contemporary  architecture  is  likely  to  be  32  bits.  The  wide  bus  of  the  6600 
would  cause  fewer  references  on  scientific  applications  —  since  an  entire  floating 
point  value  can  be  obtained  in  one  load  request  while  two  requests  are  necessary  on 
a  contemporary  machine  to  gain  similar  accuracy.  Assuming  a  cache  line  size  of 
greater  than  four  bytes,  the  request  for  the  second  half  of  a  floating  point  number 
is  likely  to  be  a  cache  hit  (for  contemporary  machines).  Thus,  the  smaller  number 
of  requests  for  floating  point  data  by  the  6600  could  cause  its  cache  to  have  a  com- 
paratively higher  miss  ratio. 

Third,  the  6600  has  poor  utilization  of  instruction  words:  Instructions  are  15 
and  30  bits  in  length  and  an  instruction  can  not  cross  a  word  boundary.  Thus,  in 
some  cases  25  percent  of  an  instruction  word  is  unused.  Furthermore,  the  destina- 
tion of  a  branch  must  be  on  a  word  boundry,  leaving  as  much  as  75  percent  of  an 
instruction  word  unused.    The  unused  portions  of  instruction  words  are  filled  with 
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NO-OPs  (15-bit  instructions)  —  no-ops  are  not  included  in  the  trace  information. 
The  poor  utihzation  of  instruction  words  could  cause  a  machine  code  sequence  to  be 
longer  on  the  6600  than  the  machine  code  generated  for  a  contemporary  machine. 
Such  an  anomoly  would  cause  a  larger  miss  ratio  since  more  references  would  be 
necessary  to  load  an  equivalent  code  sequence. 

Lastly,  the  6600  has  a  separate  normalize  instruction.  This  anomaly  has  con- 
trasting effects.  If  the  normalize  is  in  a  loop,  the  extra  instruction  will  generate  a 
lower  miss  ratio,  provided  the  loop  is  held  in  cache.  If  the  loop  can  not  be  held  in 
cache  or  if  the  normalize  is  used  in  a  serial  code  sequence,  then  the  extra  instruction 
may  generate  a  higher  miss  ratio. 

Overall,  the  use  of  trace  data  based  on  programs  generated  on  PEs  consisting 
of  6600s  will  provide  statistics  that  are  sHghtly  more  pessimistic  than  could  be 
expected  for  a  contemporary  machine.  However,  the  main  purpose  of  the  simula- 
tions is  to  provide  relative  performance  measures  so  various  cache  policies  can  be 
analysed. 

4.2.2.   Cache  Simulator 

The  purpose  of  the  cache  simulator  is  to  determine  the  miss  ratio  and  the 
amount  of  network  traffic  generated  by  various  cache  organizations.  Input  to  the 
simulator  is  a  set  of  parameters  that  specify  the  organization  of  a  cache  and  trace 
data  (as  described  above).  The  parameters  specify  the  cache  size,  line  size,  replace- 
ment algorithm,  and  update  policy.  The  output  of  the  simulator  is  a  set  of  miss 
ratios  and  the  number  of  modified  words  left  in  the  cache  at  process  termination. 
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Separate  miss  ratios  are  calculated  for  each  storage  class  and  reference  type  (e.g. 
shared  cacheable  load  miss). 

A  simple  analytic  model  is  used  to  calculate  the  delay  of  a  cache  miss.  Since 
the  function  of  the  cache  simulator  is  to  provide  miss  ratios,  the  accuracy  of  the  net- 
work delay  is  not  critical  to  the  function  of  the  simulator.  Including  the  delay,  how- 
ever, provides  the  capability  of  calculating  the  approximate  average  utilization  of  a 
PE.  Thus,  rough  comparisons  of  cache  designs  can  be  made  without  the  expense  of 
running  the  machine  simulator. 

4.2.3.   Machine  Simulator 

The  machine  simulator  is  composed  of  modules  that  simulate  the  various  com- 
ponents of  the  machine  model  described  in  chapter  2.  The  function  of  the  simulator 
is  to  determine  the  average  utilization  of  the  PEs  and  the  average  queuing  delays  of 
the  various  components.  The  level  of  simulation  detail  of  the  components  varies. 
The  following  paragraphs  describe  each  major  component,  its  level  of  simulation 
detail,  and  the  input  parameters  that  affect  the  component. 

Since  the  machine  model  is  a  MIMD  machine  the  simulator  was  designed  to 
simulate  the  execution  of  processes  from  different  process  trees.  Sibling  processes 
are  assumed  to  have  similar  characteristics  and  be  defined  by  a  process  template. 
An  input  parameter  to  the  simulator  specifies  the  number  of  different  process  tem- 
plates. When  the  simulator  is  initialized  each  PE  is  randomly  assigned  a  process 
template;  the  template  defines  the  characteristics  of  a  process  in  terms  of  request 
frequencies,  miss  ratio,  and  amount  of  data  to  be  flushed  at  process  termination  (if 
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a  write-back  policy  is  used).  PE  requests  and  requests  not  satisfied  by  the  cache  are 
generated  stochastically  based  on  the  request  frequencies  and  miss  ratios.  The 
address  of  a  request  is  produced  randomly  and  the  addresses  are  evenly  distributed 
over  the  MMs. 

The  simulator  also  provides  the  capability  of  having  a  set  of  the  processes  be 
trace-driven.  In  this  case  a  PE's  requests  are  specified  by  trace  data  (as  described 
above).  Since  the  references  of  only  one  PE  was  recorded  on  the  trace  tape,  the 
actual  memory  addresses  on  the  trace  data  are  used  only  for  instruction  references; 
addresses  for  private  and  shared  data  requests  are  produced  randomly  and  are 
evenly  distributed  over  the  MMs.  (Since  the  program  segments  that  were  traced 
consisted  of  DOALLs,  the  instructions  accessed  by  the  PEs  over  several  iterations 
were  likely  to  be  the  same;  however  the  data  locations  were  likely  to  be  quite  dif- 
ferent.) The  cache  miss  ratios  provided  as  part  of  the  process  template  are  used  to 
determine  if  a  request  causes  a  cache  miss. 

For  ease  of  programming  and  to  simplify  statistic  gathering,  a  memory  request 
is  not  broken  into  a  sequence  of  packets  that  are  transferred  through  the  network; 
rather,  an  array  of  message  structures  is  allocated  and  pointers  to  individual  array 
elements  are  passed  through  the  network.  An  array  element  contains  the  destina- 
tion of  the  message  (MM  or  PE  address),  the  full  memory  address,  the  request 
type,  and  the  size  of  the  message  in  packets.  The  message  size  (in  packets)  speci- 
fies the  service  time  for  the  message.  The  size  of  a  message  is  determined  by  the 
type  of  a  request  and  a  set  of  input  parameters:  packet  size,  address  length,  PE- 
word  size,  and  the  amount  of  MM  interlea  ving. 
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When  a  request  (message)  is  to  be  transmitted  through  the  network  the  PNI  1 
module  obtains  a  pointer  to  a  free  entry  is  the  message  array  and  stores  the  infor- 
mation about  the  request  into  the  entry.  The  PNI  delays  for  m  network  cycles 
before  servicing  another  PE  request  where  m  is  the  message  size.  When  a  cache 
miss  occurs  (and  depending  on  the  update  policy),  the  PNI  issues  a  sequence  of 
messages  to  fetch  the  line;  the  number  of  messages  issued  is  dependent  of  the  line 
size,  and  the  MM  interleaving  granularity.  If  a  read  miss  occurred  the  PNI  places 
the  PE  into  a  wait  state  until  the  request  that  caused  the  miss  returns  from  the  MM. 
When  the  PNI  issues  a  cache  line  fetch,  the  PNI  issues  the  request  for  the  data  that  J 
caused  the  cache  miss  first;  other  portions  of  a  cache  line  are  fetched  in  address 
order. 

When  a  request  returns  from  the  network  the  PNI  checks  if  the  PE  is  waiting 
for  the  address;  if  the  PE  was  waiting,  the  PNI  places  the  PE  into  a  run  state.  The 
entry  for  the  message  is  then  inserted  into  the  pool  of  free  entries. 

The  interconnection  network  used  in  the  simulation  model  is  topologically 
equivalent  to  the  Omega-network  of  Lawrie  [75]^'*.  The  stages  of  the  network  from 
the  PNIs  to  the  MMs  are  connected  by  an  inverse  shuffle  (Stone  [71]).  An  input 
parameter  specifies  the  number  of  switch  input  and  output  ports.  A  switch's  output 
ports  are  simulated  before  the  input  ports  to  prevent  a  message  from  being  received 
from  a  previous  stage  and  transmitted  to  the  next  stage  in  one  cycle.    The  output 


^*  This  network  has  the  same  topology  as  a  rectangular  SW  banyan  network  with  S=F=2  (Goke 
and  Lipovsky  [73]).  Wu  and  Feng  [78]  show  that  many  of  the  networks  proposed  for  parallel  proces- 
sors are  all  topologically  equivalent. 
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ports  are  serviced  in  a  round  robin  fashion.  If  the  input  port  of  the  next  stage  has 
its  "clear-to-send"  (CTS)  signal  on,  the  output  queue  of  the  port  is  examined;  if  the 
queue  is  non-empty  the  message  (i.e.  the  pointer)  on  the  top  of  the  queue  is 
transmitted  to  the  next  stage.  The  output  port  will  not  service  another  message  for 
m  cycles  (recall  that  m  is  message  size).  A  message  does  not  have  to  be  fully 
resident  in  the  output  queue  before  the  first  packet  of  the  message  can  be  routed  to 
the  next  stage,  i.e  the  switch  is  pipelined. 

The  input  ports  are  serviced  in  a  round-robin  fashion.  When  a  message  is 
received  at  an  input  port  it  is  routed  to  the  output  port  queue  specified  by  the 
message's  MM  or  PE  destination  address.  If  the  queue  is  filled  the  input  port  turns 
off  its  CTS  signal  to  the  previous  stage.  The  CTS  is  reset  when  the  needed  output 
buffer  contains  a  free  position." 

One  of  the  MNI's  functions  is  to  queue  messages  received  from  the  network 
and  which  are  waiting  for  service  by  the  MM.  A  message  must  be  held  in  the  MNI 
queue  for  m  cycles  while  it  is  being  assembled  before  it  is  passed  to  the  MM  for  ser- 
vice. The  other  function  of  the  MNI  is  to  queue  and  transmit  back  to  the  PEs  mes- 
sages that  have  been  serviced  by  the  MM.  When  the  MNI  receives  a  message  that 
has  been  serviced  by  the  MM  it  passes  the  message  to  the  network;  the  MNI  ser- 
vices the  message  for  m  cycles.  If  the  MNI  is  busy  when  the  MM  finishes  service, 
the  new  message  is  queued. 

The  MMs  are  simulated  as  a  delay  (no  memory  function  is  actually  performed), 
the  length  of  the  delay  being  a  function  of  the  MM  cycle  time,  an  input  parameter, 
and  the   request  type.     The   MM   cycle  time   is   independent   of  the   interleaving 
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granularity,  i.e.  for  a  fixed  MM  cycle  time  a  fetch  of  8  bytes  requires  the  same 
number  of  network  cycles  as  a  fetch  of  16  bytes.  The  module  that  simulates  the 
MMs  changes  the  message  length  once  the  message  has  been  serviced  (e.g.  a  load 
request  arrives  as  a  message  consisting  of  only  an  address,  but  departs  consisting  of 
an  address  and  data). 


CHAPTER   5 


RESULTS 


Presented  in  this  chapter  are  the  results  of  experiments  that  studied  the  effects 
of  various  cache  parameters  and  policies  on  the  performance  of  a  parallel  system. 
The  results  were  obtained  from  the  analytic  models  derived  in  section  4.1  and  from 
simulations  (in  both  trace-driven  and  self-driven  modes).  The  cache  parameters  and 
policies  studied  were  central  memory  interleaving,  cache  line  size,  and  update  poli- 
cies. Meaningful  results  about  mappmg  algorithms  could  not  be  obtained  because  of 
the  nature  of  the  trace  data;  the  address  space  was  limited  and  the  WASHCLOTH 
environment  does  not  provide  a  means  for  a  process  to  inherit  a  parent  process's 
address  space. 

Four  parallelized  numeric  programs  were  used  to  obtain  trace  data.  Each  pro- 
gram was  derived  from  a  serial  version  of  the  program  by  modifying  it  by  hand  to 
obtam  as  much  parallelism  as  possible.  The  parallelized  programs  were  run  in  the 
WASHCLOTH  environment  and  the  address  references  were  recorded.  The  pro- 
grams used  were:  SIMPL  (a  two-dimensional  hydrodynamic  code),  a  three- 
dimensional  weather  forecasting  program  (Korn  and  Rushfield  [83]),  a  multigrid 
Poisson  pde  solver,  and  a  polymer  simulation  (Bishop  [83]).  The  trace  data  was 
used  to  drive  the  cache  simulator. 

The  machine  simulator  was  used  predominately  in  self-driven  mode.  The  simu- 
lated machine  was  configured  to  have    128   PEs  and  an  Omeea  network  with   an 
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inverse  shuffle  connection  from  the  PEs  to  the  MMs.  For  clarity  of  discussion  in 
the  following  sections,  the  physical  network  is  assumed  to  be  composed  of  two  logi- 
cal networks:  the  request  network  and  the  response  network.  The  request  network 
is  used  to  route  requests  from  the  PEs  to  the  MMs.  The  response  network  routes 
responses  (i.e.  requests  that  have  been  serviced  by  a  MM)  from  the  MMs  to  the 
PEs.  For  each  experiment  the  simulator  was  run  for  30.000  network  cycles;  how- 
ever during  the  first  10,000  cycles  (start-up  time)  the  waiting  times  were  not 
recorded  -  the  start-up  time  allowed  the  machine  to  reach  steady  state. 

Requests  and  responses  are  assumed  to  have  a  header  consisting  of  four  bytes. 
The  header  contains  a  MM  (PE)  number,  memory  offset,  and  memory  operation. 
In  the  case  of  a  load  request,  the  request  consists  of  only  the  header;  likewise,  a 
store  response  consists  only  of  a  header.  A  store  request  and  a  load  response  con- 
sist of  the  header  plus  data. 

5.1.   Central  Memory  Interleaving 

In  section  3.3  two  different  memory  organizations,  with  respect  to  a  cache  line, 
are  discussed.  The  first  method  is  to  store  a  full  cache  line  in  one  MM,  i.e.  inter- 
leave memory  on  a  cache  line  basis.  The  second  is  to  distribute  a  cache  line  over 
two  or  more  MMs,  decreasing  the  interleaving  granularity  by  decreasing  the  number 
of  consecutive  bytes  stored  in  a  single  MM.  The  effects  of  the  different  interleaving 
methods  on  system  performance  are  examined  in  this  section.  The  effects  of  vari- 
ous interleaving  granularities  are  initially  examined  using  the  analytic  models, 
observing  the  effects  on  the  waiting  times  of  the  queuing  model  queues.  Subse- 
quently, simulation  results  are  presented,  graphically  demonstrating  the  effects  of 
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interleaving  on  system  performance.  In  the  following  discussion  all  cache  parame- 
ters are  assumed  to  be  constant  with  respect  to  the  interleaving  granularity  unless 
otherwise  stated. 

As  stated  in  section  4.1.1,  a  cache  line  fetch  arrives  at  the  PNI  as  a  single 
request.  Depending  on  the  interleaving  granularity  the  PNI  will  generate  one  or 
more  requests  for  the  miss.  Thus,  the  queuing  delay  at  the  PNI  is  minimized  when 
interleaving  is  done  on  a  cache  line  basis.  As  the  granularity  of  the  interleaving 
decreases,  the  waiting  time  at  the  PNI  will  increase  because  the  service  time  for  a 
cache  miss  increases. 

The  queuing  delays  in  the  request  network  and  the  MNItn  are  also  minimized 
when  interleaving  is  on  a  cache  line  basis.  These  queues  are  minimized  since  the 
arrival  rate  of  messages,  p,  given  in  eq.  (4.8),  is  minimized  (ct)  is  equal  to  one).  As 
the  granularity  of  interleaving  decreases,  the  arrival  rate  increases,  since  the  PNI 
generates  more  requests  to  fetch  a  cache  line. 

Interleaving  affects  the  MNIout  and  the  response  network  in  an  opposite  fashion 
than  it  affects  the  MNIin  and  request  network.  If  interleaving  is  on  a  cache  line 
basis,  the  message  size  of  a  response  is  large.  Eq.  (4.13)  and  (4.17)  show  that  the 
average  delays  for  these  queues  are  more  sensitive  to  the  message  size  than  to  the 
probability  of  an  arrival.  Thus,  decreasing  the  interleaving  granularity  would 
decrease  the  waiting  times  in  these  queues. 

The  above  discussion  indicates  that  the  interleaving  granularity  is  a  trade-off 
between  the  message  arrival  rate  and  the  average  message  size,  and  a  trade-off 
between   the   penalty   a   PNI    pays    for    initiating   a    line    fetch   and   the   amount   of 
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interference  the  fetch  generates  on  requests  from  other  PNIs  (PEs):  Large  grain 
interleaving  (interleaving  on  a  cache  line  basis)  decreases  PNI  and  MM  overhead, 
but  increases  response  network  interference.  Decreasing  the  interleaving  granular- 
ity increases  PNI  and  MM  overhead,  but  decreases  response  network  interference. 
Results  of  a  simulation  study  of  the  memory  interleaving  trade-offs  are  given  below. 
In  these  simulations  the  interleaving  granularity  was  varied  for  a  given  line  size. 
All  other  system  parameters,  with  the  exception  of  those  in  Table  5.1,  were  held 
constant.  The  results  were  generated  using  the  second  level  simulator  in  a  self- 
driven  mode.  For  the  experiments  the  PE  request  mix  was  607c  instructions,  30% 
loads,  and  10%  stores.  A  store-through  policy  was  used.  A  store-through  policy 
provides  a  level  of  "background"  traffic,  i.e.  traffic  other  than  that  produced  by 
cache  misses. 

The  graphs  in  Figures  5.1  -  5.6  show  the  average  PE  utilization  vs.  interleaving 
granularity.  The  interleaving  granularity  is  given  as  the  number  of  consecutive 
bytes  stored  in  a  single  MM  (the  granularity  increases  from  left  to  right).  Different 
cache  line  sizes  are  shown  on  a  single  graph  -  A  is  16  bytes,  B  is  32  bytes,  and  C  is 
64  bytes.  Table  5.1  indicates  parameters  that  were  varied  and  their  values.  The  left 
column  gives  the  experiment  numbers  which  correspond  to  the  numbers  on  the 
graphs.  The  packet  size  is  the  number  of  bytes  that  can  be  transfered  between  two 
network  switches  within  one  network  cycle.  A  message  size,  m,  is  equal  to  the 
request  or  response  size  divided  by  the  packet  size.  For  example,  if  the  packet  size 
is  two,  a  store  request  (which  consists  of  eight  bytes)  would  have  a  message  size  of 
four  packets.    The  PE  cycle  time,  Cpg.  and  MM  cycle  time,  C^„  are  given  in  terms 
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of  network  cycles.  For  example,  a  PE  cycle  time  of  four  implies  that  a  PE  can  issue 
a  request  every  four  network  cycles;  a  MM  cycle  time  of  four  implies  that  a  MM  can 
service  a  memory  request  every  four  network  cycles. 


experiment 
number 

packet 
size 

PE 

cycle 

MM 

cvcle 

miss  ratio 

instruction 

data 

1 

4 

2 

4 

.01 

.10 

2 

4 

2 

4 

.02 

.20 

3 

4 

4 

4 

.02 

.20 

4 

4 

4 

8 

.02 

.20 

5 

2 

4 

4 

.02 

.20 

6 

2 

4 

8 

.02 

.20 

Table  5.1.  Interleave  experiments  parameters  and  values 
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The  graphs  indicate  that  interleaving  memory  on  a  cache  line  basis  is  not  a 
good  memory  organization.  Such  a  memory  organization  results,  in  general,  in  the 
poorest  average  PE  utilization  (for  the  shown  experiments).  The  cause  of  this  poor 
utilization  is  due  to  large  queuing  delays  at  the  MNIout  queues  and  at  the  first  feu- 
stages  of  the  response  network  as  was  indicated  by  the  analytic  models.  Table  5.2 
gives  the  average  MNIout  queue  waiting  times  for  e.xperiment  5.  Note  that  the  wait- 
ing times  increase  as  the  interleaving  granularity  increases. 


line 
size 

granularity 

4 

8 

16 

32 

64 

16 

32 
64 

0.0 
0.0 
0.0 

.250 

.636 

1.857 

.484 
1.123 
2.813 

1.757 
4.044 

5.722 

Table  5.2.   Werage  waiting  time  for  MNIout 

For  finer  grains  of  interleaving  the  major  queueing  delays  shift  to  the  A/A7,>, 
and  the  latter  stages  of  the  response  network,  i.e.  those  stages  closest  to  the  PEs. 
The  large  delays  at  the  /V/A'/,„  queue  are  due  to  the  increased  arrival  rate  (i.e.  more 
requests  are  generated  to  fetch  a  cache  line).  Table  5.3  gives  the  average  .V/A7,>, 
queue  waiting  times  for  experiment  5.  (Note  that  the  delays  decrease  left  to  right 
where  the  delays  in  Table  5.2  increase  left  to  right). 


line 
size 

granularity 

4 

8 

16 

32 

64 

16 
32 
64 

.479 
1.135 
3.509 

.223 
.444 
.925 

.112 
.203 
.370 

.097 
.164 

.073 

Table  5.3.  Average  waiting  time  for  MNIi, 
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The  large  queuing  delays  in  the  response  network's  switches  close  to  the  PEs 
are  due  to  responses  from  a  cache  line  fetch  merging:  The  network  topology  is 
assumed  to  be  an  omega  network  with  an  inverse  shuffle  connection  from  the  PEs 
to  the  MMs.  Figure  5.7  shows  such  a  network.  The  dashed  lines  indicate  the  paths 
that  requests  would  take  for  a  single  cache  line  fetch  issued  by  PE^  if  a  cache  line 
were  interleaved  across  four  MMs.     The  requests  are   issued  serially  from  a  PE 
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Figure  5.7.  Inverse  shuffle  connection  newtork 
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(PNI),  then  branch  out  in  a  tree-like  fashion  during  the  first  two  stages  of  the 
request  network.  In  general,  if  a  cache  line  is  fetched  using  n  requests,  the  requests 
will  branch  out  in  the  first  log^  stages  of  the  request  network.  After  the  first  log^ 
stages  the  requests  will  traverse  different  paths  to  the  MMs.  Responses  will  like- 
wise traverse  different  paths  until  the  last  logn  stages  of  the  response  network.  In 
these  stages  the  responses  will  "merge"  and  follow  the  same  path.  Therefore,  the 
queuing  delays  in  these  switches  become  large  since  the  responses  are  routed  to  the 
same  sink  (PE)  and  must  be  serviced  by  the  switches  serially.  Earlier  stages  do  not 
have  the  large  delays  because  the  responses  traverse  different  paths.  Fetching  a 
cache  line  with  multiple  requests  violates  the  assumption  of  request  independence  in 
the  analytic  model.  Therefore,  the  model  is  not  capable  of  predicting  the  waiting 
times  of  responses  in  the  latter  stages  of  the  network. 

Though  the  queuing  delays  in  the  latter  stages  of  the  response  network  may  be 
larger  than  the  delays  closer  to  the  MMs  they  have  a  lesser  effect  on  system  perfor- 
mance for  two  reasons:  First,  the  blockage  is  localized  to  a  small  subset  of  PEs;  if 
the  blockage  is  by  a  MM,  a  larger  number  of  PEs  can  be  affected.  Second,  the  data 
that  is  blocked  is  likely  to  be  the  prefetched  portion  of  a  cache  line.  One  of  the 
assumptions  was  that  the  portion  of  a  cache  line  that  causes  the  miss  would  be 
issued  by  the  PNI  first.  The  remainder  of  the  line  is  a  prefetch  (relying  on  the 
sequential  nature  of  programs). 

If  a  shuffle  connection  was  used  between  the  PEs  and  the  MMs,  requests  from 
a  cache  line  fetch  would  not  branch  out  until  the  last  \ogn  stages  of  the  request  net- 
work.   Responses  would  then  traverse  most  of  the  response  network  in  a  serial 
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fashion  causing  large  delays  in  the  switch  stages  (of  the  response  network)  closest  to 
the  MMs. 

For  the  experiments  run,  the  critical  waiting  times,  with  respect  to  system  per- 
formance, are  associated  with  the  A/,V/,>,,  MNIoui^  and  with  the  first  three  stages  of 
the  response  network.  Let  CCWT,  the  cumulative  critical  waiting  times,  be  the  sum 
of  the  average  waiting  times  for  these  queues.  Figures  5.8  -  5.13  show  the  percen- 
tage of  total  waiting  time,  T\\'T,  spent  in  these  critical  queues  for  the  same  e.xperi- 
ments  as  shown  in  Figures  5.1  -  5.6.  In  the  majority  of  cases  when  the  percentage 
of  CCWT  is  minimized,  optimal  performance  is  obtained.  If  the  first  or  first  two 
stages  of  the  response  network  are  used  in  computing  the  CCWT  rather  than  the 
first  three  stages,  the  number  of  cases  where  there  is  a  correspondence  between  the 
minimal  percentage  of  CCW7and  optimal  performance  decreases. 
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PERCENTAGE  CCWT  VS.  INTERLERVING 
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PERCENTAGE  CCW7  VS.  INTERLEAVING 
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PERCENTAGE  CCWT  VS.  INTERLEAVING 
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PERCENTAGE  CCWT  VS.  INTERLEAVING 
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PERCENTAGE  CCWT  VS.  INTERLEAVING 
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Memory  cycle  time  has  a  noticeable  effect  on  memory  interleaving.  Consider 
pairs  of  experiments:  3  and  4.  5  and  6.  For  a  given  pair,  the  parameters  are  the 
same  except  for  memory  cycle  time.  Experiments  4  and  6  have  twice  the  cycle  time 
of  experiments  3  and  5,  respectively.  For  both  pairs  of  experiments  the  optimal 
interleaving  granularity  is  more  coarse  for  the  slower  memory.  The  reason  for  the 
shift  in  granularity  is  due  to  the  sensitivity  of  the  A/A7/„  to  the  memory  cycle  time 
(see  eq.  4.16)  and  due  to  the  ability  of  the  MNIqui  to  service  responses  for  several 
interleaving  granularities  more  rapidly  than  the  MM  can  service  requests.  For  a 
packet  size  of  4  and  a  MM  cycle  time  of  8,  the  queuing  delay  at  the  MNIout  ^'iH  be 
zero  for  interleaving  granularities  ^  16  bytes  (a  message  would  be  ^  5  packets).  If 
the  packet  size  is  two,  the  queuing  delay  will  be  zero  for  interleaving  granularities 
<  8. 

5.2.   Line  Size  Choice 

As  described  in  section  3.3,  the  cache  line  size  affects  system  performance  in  a 
conflicting  manner.  To  obtain  optimal  performance,  miss  ratio  improvements  and 
the  negative  impact  of  increased  network  traffic  must  be  balanced.  Presented  in  this 
section  are  the  results  of  experiments  that  investigate  line  size  relationships  and  how 
line  size  choice  affects  a  parallel  processor's  performance.  The  relationships  and 
effects  are  initially  demonstrated  using  the  analytic  models.  Results  of  simulation 
experiments  show  the  severity  of  the  effects  of  line  size  changes.  The  experiments 
were  conducted  using  the  cache  and  machine  simulators.  For  the  results  presented, 
the  cache  simulator  was  run  in  trace-driven  mode,  while  the  machine  simulator  was 
run  in  self-driven  mode.    For  simplicity  and  clarity  all  cache  parameters  (notably  the 
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interleaving  granularity!  are  held  constant  in  the  following  discussion  unless  other- 
wise stated. 

Eq.(4.1)  indicates  that  a  reduction  in  miss  ratio  directly  affects  the  PE  utiliza- 
tion by  reducing  r,^.,  the  frequency  of  requests  the  PE  must  await  before  continuing 
useful  computation  (r^.  is  defined  in  Eq.  (4,4)).  A  secondary  effect  is  the  miss 
ratio's  influence  on  t,  the  average  transit  time.  As  indicated  in  Eq.  (4,18)  t  is  a 
composite  of  all  the  waiting  times.  A  parameter  of  the  waiting  times  is  p,  the  net- 
work traffic  intensity,  defined  in  Eq.  (4.8).  The  miss  ratio  affects  the  traffic  inten- 
sity in  two  ways:  directly  through  co,  the  frequency  of  cache  misses  (defined  in  Eq. 
(4.7a)),  and  through  Upe.  the  PE  utilization.  Although  a  doubling  in  line  size  may 
halve  the  miss  ratio,  the  network  traffic  may  increase  because  of  an  increase  in 
Upg.  If  the  miss  ratio  is  less  than  halved,  the  product  of  6(jj  will  increase  with  the 
line  size. 

Let  the  ratio  r,  be  the  miss  ratio  improvement  for  a  line  size  of  /.  For  a  given 

mi 

line  size  of  /.  r,  =  ,  where  m,-  is  the  miss  ratio  of  a  line  of  size  /.    Let  r  be  the 

nii  2 

miss  ratio  improvement  for  all  /,  i.e.  if  r  =  .5  then  the  miss  ratio  is  halved  each 
time  the  line  size  is  doubled.  Presented  in  tables  5.4  and  5.5  are  the  average  wait- 
ing times  for  /V/A7,>,  and  M}JIoui  queues,  respectively,  for  different  line  sizes  and  dif- 
ferent values  of  r.  Table  5  6  gives  the  corresponding  average  PE  utilizations.  The 
values  were  generated  by  running  the  machine  simulator  in  self-driven  mode.  A 
line  size  of  16  bytes  was  considered  a  base;  a  miss  ratio  of  2.5%  was  assumed  for 
instructions  and  a  miss  ratio  25%  was  assumed  for  data.  The  miss  ratios  for  the 
other  line  sizes  in  a  given  row  are  determined  by  appling  the  value  of  the  ratio  r  for 
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that  given  row.  For  example,  if  ;  =  .5,  then  the  miss  ratio  for  a  line  size  of  32  bytes 
is  12.57c,  for  a  line  size  of  64  bytes  is  6.25'7c,  and  for  a  Ime  size  of  128  bytes  is 
3.125%.  The  PE  request  mix  was  60Vc  instructions.  30'7f  loads,  and  \09c  stores. 
The  interleave  granularity  \^as  8  bytes.  The  packet  size  was  2,  the  PE  cycle  time 
was  4,  and  MM  cycle  time  was  4.    A  store-through  update  policy  was  assumed. 


;■ 

line  size 

16 

32     1     64 

128 

.5 

.243 

.354 

.385 

.393 

.6 

.243 

.379 

.590 

.682 

.7 

.243 

.417 

.725 

1.120 

.8 

.243 

.436 

.815 

1.563 

.9 

.243 

.455 

.879 

1.743 

Table  5.4.  Averase  waiting-time  for  MNli„ 


r 

line  size 

16 

32 

64 

128 

.5 

.276 

.428 

.458 

.467 

.6 

.276 

.486 

.800 

1.052 

.7 

.276 

.550 

1.254 

2.292 

.8 

.276 

.600 

1.488 

4.053 

.9 

.276 

.643 

1.719 

5.375 

Table  5.5.  Average  waiting  time  for  MS'lout 


r 

line  size 

16 

32 

64 

128 

.5 

.59 

.72 

.83 

.89 

.6 

.59 

.75 

.75 

.80 

.7 

.59 

.64 

.68 

.67 

.8 

.59 

.62 

.62 

.56 

.9 

.59 

.69 

.56 

.45 

Table  5.6.  Average  PE  Utilization 
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When  r  =  .5  the  average  queuing  delays  for  line  sizes  of  32,  64, and  128  bytes 
are  relatively  equal.  The  slight  increase  in  the  delays  is  due  to  the  increase  in  PE 
utilization,  as  indicated  by  the  analytic  models.  A  large  increase  in  the  average 
queuing  delays  exists  between  a  line  size  of  16  and  32  bytes.  This  dramatic  mcrease 
is  due  to  the  dramatic  increase  in  utilization.  As  r-1  the  average  queuing  delays 
increase  rapidly  with  changes  in  line  size.  For  a  give  line  size  the  average  waiting 
times  increase  and  the  utilization  decreases  --  the  waiting  times  increase  even  though 
the  utilization  decreases  because  the  cjio)  term  in  Eq.  (4.8)  overwhelms  changes  in 

Upe. 

As  indicated  in  the  previous  section,  as  the  interleaving  granularity  mcreases,  p 
is  reduced,  since  <()co  is  reduced.  Shown  in  Table  5.7  are  the  MNI,n  and  MNJau,  wait- 
ing times  for  a  line  size  of  128  bytes.  System  parameters  are  the  same  as  above 
except  an  interleaving  granularity  of  16  bytes  was  used.  Note  that  the  MMlin  delays 
are  reduced,  but  the  MNIout  delays  increase  -  due  to  the  larger  message  sizes.  The 
utilization  for  the  two  experiment  groups  were  essentially  identical. 


queue 

r 

.5 

.6 

.7 

.8 

.181 
.874 

.306 

1.778 

.550 
3.467 

.595 
6.616 

Table  5.7.  Average  waiting  time  for  MNIi„  and  MNIout 

Figures  5.14  -  5.19  show  how  system  performance  is  affected  by  line  size  for 
various  system  configurations.  The  result  were  produced  using  the  machine  simula- 
tor in  self-driven  mode.  For  these  experiments  the  instruction  mix  and  update  pol- 
icy   were    the    same    as    in    the    experiments    described    above.     The    interleave 
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granularity  was  8  bytes  for  line  sizes  of  16  -  64  bytes.  A  granularity  of  16  bytes  was 
used  for  line  size  of  128  bytes.  Parameters  and  values  particular  to  a  given  experi- 
ment are  shown  in  Table  5.8.  In  these  experiments  a  line  size  of  32  bytes  was  the 
base;  the  assumed  miss  ratio  for  instructions  was  27f  and  the  miss  ratio  for  data 
was  20%.  All  other  miss  ratios  where  calculated  using  r.  Note  that  the  miss  ratio 
for  a  line  size  of  16  bytes  increases  as  r  decreases.  Different  values  of  r  are  indi- 
cated on  a  single  graph:  r=  .7  is  A.  r=  .8  is  B,  and  r=  .9  is  C. 


experiment 

packet 

PE 

MM 

number 

size 

cycle 

cycle 

1 

4 

2 

4 

2 

4 

4 

4 

3 

■      4    , 

4 

8 

4 

2 

2 

4 

5 

2 

4 

4 

6 

2 

4 

8 

Table  5.8.  Line  size  experiments  parameters  and  values 
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PE  UTILIZATION  VS.  LINE  SIZE 
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Figure  5.14 


99 


PE  UTILIZATION  VS.  LINE  SIZE 
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Figure  5.15 
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PE  UTILIZATION  VS.  LINE  SIZE 


80  I  I  I  I  I  I  1  I  1  I  I  I  I  I  I  I  I  I  I  I  1  I  I  I  I  I  I  I  I  t 


75 


70 


65 


^  60 


a: 

0-  55 

z 


(X 


50  - 


45  - 


40 


35 


30 


25  i \ I I I I I I I I I I I \ I I I i_L 


-I 1 I I — I- 


4.0  4.2  4.4  4.6  4,3  5.0  5.2  5.4  5.6  5.6  6.0  6.2  6.4  6.6  6.8  7.0 

LINE  SUE    (2««X  BYTES) 


Figure  5.16 
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PE  UTILIZATION  VS.    LINE   SIZE 
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Figure  5.17 
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PE  UTILIZATION  VS.    LINE  SIZE 
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Figure  5.18 
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PE  UTILIZATION  VS.  LINE  SIZE 
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Figure  5.19 
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The  graphs  indicate  that  when  the  packet  size  is  four,  a  line  size  of  128  bytes 
produces  optimal  performance  provided  /•  is  :S.7.  When  the  networtc  bandwidth  is 
halved,  i.e.  the  packet  size  is  reduced  to  two,  the  excess  traffic  produced  by  a  line 
size  of  128  bytes  has  a  greater  negative  effect  on  system  performance  than  the 
reduction  in  miss  ratio  has  a  positive  effect.  The  cause  of  the  disparity  is  due  to  the 
increased  number  of  network  packets  needed  to  transmit  a  message  (i.e.  a  partial 
cache  line):  Eq.  4.13  shows  that  the  switch  waiting  times  are  quadratic  in  the  mes- 
sage size.  For  a  system  configurations  were  the  packet  size  is  small  a  line  size  of  64 
bytes  provides  optimal  performance  (when  r  is  ^.7).  As  r,  -1  the  line  size  that  pro- 
duces optimal  performance  becomes  smaller  (see  figures  5.2  -  5.6  for  PE  utilization 
when  r  =1).  The  reduction  in  line  size  is  due  to  the  networks  inability  to  handle  the 
excess  traffic  produced  by  large  line  fetches  when  the  reduction  in  miss  ratio  is 
small. 

The  graphs  also  indicate  that  memory  cycle  time  has  an  effect  on  line  size 
choice.  When  the  memory  cycle  time  is  large  relative  to  the  network  cycle  time, 
large  line  sizes  can  be  tolerated  over  a  voider  range  of  r..  In  Figure  5.16.  the  graph 
shows  that  the  system  configuration  can  support  a  line  size  of  128  bytes  for  r  =  .1 
to  .8.  In  Figure  5.19,  system  configuration  can  support  a  line  size  of  128  bytes  for  r 
=  .7.  The  reason  for  the  difference  is  due  to  the  ability  of  the  MNlout  to  clear  its 
queue  before  receiving  another  response  from  the  MM. 

Presented  below  in  Table  5.9  are  some  of  the  results  of  experiments  that  deter- 
mined the  miss  ratio  of  the  computational  units  comprising  the  parallel  programs 
described  in  the  introduction  of  this  chapter.    The  miss  ratios  were  obtained  running 
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the  cache  simulator  in  trace-driven  mode.  For  these  experiments  the  follow  cache 
parameters  were  used:  32k  byte  cache,  LRU  replacement,  set-associative  size  8. 
and  a  store-through  update  policy.  The  cache  line  size  was  varied  between  16  bytes 
and  128  bytes.  The  miss  ratios  are  given  separately  for  instructions,  private  loads, 
and  shared  cacheable  loads.  Since  a  store-through  policy  was  used,  stores  are  not 
included  in  the  miss  ratios. 


experiment 
number 

linetize 

16 

32               1 

64                1 

128              1 

I 

P 

s 

I 

P        s 

I 

1     P 

I    s 

I 
.001 
.001 
.001 
.001 
.001 
.001 

p — M 

1 
2 
3 
4 
5 
6 

.002 
.002 
.002 
.002 
.002 
.005 

.002 
.004 
.009 
.003 
.003 
.030 

.064 
.321 
.534 
.582 
.763 
.386 

.001 
.001 
.001 
.001 
.001 
.003 

.001 
.003 
.006 
.002 
.002 
.018 

.040 
.173 
.534 
.448 
.726 
.359 

.001 
.001 
.001 
.001 
.001 
.002 

.001 
.002 
.004 
.002 
.002 
.012 

.025 
.099 
.534 
.386 
.721 
.316 

.001 
.001 
.003 
.001 
.001 
.009 

.017 
.062 
.307 
.367 
.740 
.286 

Table  5.9.  Miss  ratios 

The  instruction  and  private  data  miss  ratios  for  these  experiments  are  very 
small  due  to  chunking.  Instead  of  generating  a  separate  process  for  each  doall  itera- 
tion, a  single  process  represents  several  iterations.  Thus,  the  penalty  of  loading  the 
cache  is  amortized  over  several  iterations,  producing  a  ^mall  average  miss  ratio  - 
the  miss  ratio  for  the  first  iteration  is  on  the  order  of  a  few  percent.  The  shared 
cacheable  data  does  not  have  the  same  characteristics.  In  these  experiments  refer- 
ences to  shared  cacheable  data  do  not  have  a  high  degree  of  temporal  locality. 
Moreover,  the  chunking  is  done  dynamically,  i.e.  when  a  PE  finishes  an  iteration,  it 
is  assigned  the  next  iteration  (a  fetch-and-add  function  is  applied  to  the  loop  induc- 
tion   variable).     Some    improvement    could    be    obtained    if   static   chunking   was 
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employed,  taking  advantage  of  spatial  locality.    The  amount  of  improvement  would 
be  dependent  of  the  data  reference  patterns. 


e.xperiment 
number 

r 

32 

64 

128 

I 

P 

S 

I 

P 

S 

I 

P 

S 

1 

.500 

.500 

.625 

1.00 

1.00 

.625 

1.00 

1.00 

.680 

2 

.500 

.750 

.539 

1.00 

.666 

.572 

1. 00 

.500 

.626 

3 

.500 

.666 

1.00 

1.00 

.500 

1.00 

1.00 

.750 

.574 

4 

.500 

.666 

.769 

1.00 

1.00 

.861 

1.00 

.500 

.950 

5 

.500 

.666 

.951 

1.00 

1.00 

.993 

1.00 

.500 

1.02 

6 

.600 

.600 

.930 

.666 

.666 

.880 

.500 

.750 

.905 

Table  5.10.  Miss  ratio  improvements 

.  Present  in  Table  5.10  are  the  values  of  r  generated  from  the  data  in  table  5.9. 
The  values  appear  to  have  no  pattern  and  vary  over  a  wide  ran'ge  (.5  to  1.02).  Such 
a  lack  of  consistency  indicates  that  line  size  choice  is  a  compromise,  i.e.  a  line  size 
must  be  chosen  that  will  provide  optimal  performance  for  the  largest  class  of  pro- 
gram. The  large  values  for  shared  cacheable  data  indicates  that  this  storage  class 
has  a  small  degree  of  locality,  both  temporal  and  spatial. 

To  demonstrate  the  effects  of  line  size  on  system  performance  the  miss  ratios 
presented  in  Table  5.9  and  their  corresponding  PE  request  mixes  were  used  as 
inputs  to  the  machine  simulator.  Two  machine  configurations  were  simulated.  The 
configurations  were  identical  except  for  the  network  bandwidth;  one  configuration 
had  a  network  packets  size  of  four  bytes  and  the  another  had  a  network  packet  size 
of  two  bytes.  Tables  5.11a  and  5.11b  show  the  average  PE  utilizations  for  the  simu- 
lation experiments.  The  other  system  parameters  were  set  as  follows:  the  inter- 
leave granularity  was  8  bytes  for  a  line  size  of  16  to  64  bytes  and  16  bytes  for  a  line 
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size  of  128  bytes. 


experiment 
number 

line  size 

16 

32 

64 

128 

1 

.93 

.96 

.97 

.97 

2 

.80 

.88 

.92 

.94 

3 

.70 

.70 

.69 

.75 

4 

.69 

.74 

.76 

.72 

5 

.67 

.68 

.66 

,58 

6 

.68 

.70 

.71 

.69 

Table  5.11a.  .Average  PE  Utilization  packet  size  4 


experiment 
number 

line  size 

16 

32 

64 

128 

1 

.92 

.95 

.96 

.96 

2 

.75 

.84 

.89 

.90 

3 

.65 

.64 

.58 

.60 

4 

.64 

.68 

.67 

.56 

5 

.61 

.61 

.55 

.41 

6 

.62 

.64 

.62 

.53 

Table  5.11b.  Average  PE  Utilization  packet  size  2 

The  results  in  Table  5. Ha  indicate  that  a  line  size  of  64  bytes  provides  the  best 
overall  average  PE  utilization.  In  only  one  experiment  did  a  line  size  of  128  pro- 
vide a  significant  improvement  over  a  line  size  of  64  bytes.  Table  5.11b  indicates 
that  a  reduction  in  network  bandwidth  cause  a  shift  in  the  optimal  line  size  to  a 
smaller  line  size.  For  the  results  shown  in  Table  5.11b  a  line  size  of  32  bytes  pro- 
vides the  best  overall  average  PE  utilization. 


5.3.   Update  Policy  Choice 

Presented  in  section  3.5  is  a  discussion  of  update  policies  with  respect  to  paral- 
lel processing  systems.   The  discussion  indicates  that  the  choice  of  update  policy  is  a 
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function  of  store  frequency  and  store  reference  patterns.  It  also  indicates  that  no 
single  update  policy  may  provide  optimal  performance;  rather,  a  "combined"  update 
policy  may  be  required  to  achieve  optimal  performance.  This  section  examines  the 
effects  of  update  policies  on  system  performance.  The  effects  are  initially  examined 
using  the  analytic  models.  Subsequently,  tables  and  graphs  are  used  to  present  the 
results  from  the  cache  and  machine  simulators.  In  the  following  discussions  a  no- 
store-allocate  (NSA)  store-miss  policy  is  assumed  to  be  used  with  store-through  and 
a  store-allocate-fetch  (SAF)  store-miss  policy  is  assumed  to  be  used  with  store-in 
unless  otherwise  stated. 

Since  the  cache  is  assumed  to  be  lockup-free,  the  update  policy  does  not 
directly  affect  r^, ,  the  frequency  of  requests  the  PE  must  await. ^  The  update  policy 
does,  however,  indirectly  affect  r,,.  through  the  load  miss  ratio  —  in  general,  a 
store-in  policy  will  provide  a  lower  load  miss  ratio  than  a  store-through  policy. 
The  major  influence  of  the  update  policy  is  on  p.  the  traffic  intensity.  A  store- 
through  policy  will  contribute  to  p  a  network-request  frequency  equal  to  the  fre- 
quency of  stores  issued  by  the  PE .  A  store-in  cache  will  contribute  a  network- 
request  frequency  equal  to  the  frequency  of  PE  stores  times  the  store  miss  ratio. 
Store-in  will  also  produce  a  burst  of  network  traffic  during  a  context  switch  due  to  a 
cache  flush  necessary  to  update  central  memory. 

In  general,  a  store-through  policy  will  generate  more  network  requests  than  a 
store-in   policy;   however,   a   store-in   policy   could   produce   the   same   number   of 


■A  lockup-free  cache  does  not  prevent  the  PE  from  issuing  requests  while  store-miss  fetches  are 
outstanding. 
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network  requests  and  generate  a  greater  aggregate  number  of  network  packets  than 
store-through:  Let  f  be  the  store  miss  ratio  (for  both  private  and  shared  cacheable 
data)  and  cf)  be  as  defined  in  Eq.  (4.8a).  Also  let  the  miss  ratio  for  loads  be 
independent  of  the  update  policy.  If  ^f=\,  the  number  of  cache-line  fetches  gen- 
erated by  store  misses  (in  a  cache  using  store-in)  will  equal  the  number  of  stores 
issued  by  the  PE".  Consider  a  cache  Ime  of  32  bytes  and  an  interleaving  granularity 
of  8  bytes,  making  (t)  =  4.  If /^  =  .25,  the  number  of  network  requests  generated  by 
the  store  misses  would  equal  the  number  of  network  stores  if  a  store-through  policy 
was  used.  If  a  line  size  of  64  bytes  is  chosen  with  an  interleaving  granularity  of  8 
bytes,  making  (t)  =  8,  the  number  of  network  requests  would  equal  the  number  of  PE 
stores  if /'  =  .125 

Let  the  number  of  packets  needed  to  transmit  the  header  through  the  network 
be  a,  let  5  be  the  number  of  packets  needed  to  transmit  a  store  word,  and  let  I,  be 
the  interleaving  granularity  in  packets  (i.e.  if  the  interleaving  granularity  is  8  bytes 
and  the  network  packet  size  is  2  bytes,  then  t,  =  'X).  In  the  request  network  a  store- 
through  policy  generates  messages  of  size  a-f-8;  a  store-in  policy  generates  i>  mes- 
sages of  size  a.  for  each  store  miss.  For  each  store  request  generated  by  a  store- 
through  cache  a  response  of  size  a  will  be  generated  by  a  MM  and  traverse  the 
response  network  ;  the  responses  to  the  store-in  cache's  store-miss  fetches  will  con- 
sist of  4)  messages  of  size  a-l-^  packets.    If  ^>8  and  (J>/^  =  1,  then  a  store-in  cache 


"It  IS  assumed  that  an  entire  cache  line  is  fetched  on  a  cache  miss  regardless  of  the  storage  type 
that  caused  the  miss. 

"A  response  to  a  store  must  be  generated  in  a  parallel  processor  to  ensure  sequential  consistency. 
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will  generate  more  packets.  The  number  of  messages  generated  by  the  two  policies 
are  the  same,  but  a  total  number  of  packets  generated  by  the  store-in  policy  will  be 
greater.    The  results  from  Section  5.1  indicate  that  I,  should  be  25. 

Since  a  store-in  policy  provides  a  smaller  load  miss  ratio  than  store-through, 
(J)/^  must  be  >1  for  the  two  policies  to  generate  the  same  number  of  requests.  If 
(j)/^«l  store-in  appears  to  be  a  better  policy  since  it  generates  fewer  network 
request,  thus  minimizing  the  amount  of  interference  stores  have  on  loads.  At  pro- 
cess termination,  however,  the  cache  must  be  flushed.  The  flush  traffic  may  cause  a 
greater  hindrance  on  performance  than  the  store-through  traffic  because  the  flush 
traffic  is  bunched  where  the  store  traffic  generated  by  store-through  is  more  evenly 
distributed. 

If  a  store-allocate-non-fetch  (SANF)  policy  is  used  with  store-in  stores  ill  gen- 
erate network  traffic  only  if  the  allocation  of  a  line  evicts  a  dirty  line.  The  fre- 
quency of  cache  misses,  co,  would  only  be  the  sum  of  the  frequency  of  instruction 
and  load  misses  —  Tp  and  F,  in  Eq.  (4.7a)  would  be  equal  to  zero.  Store-in  with  a 
SANF  policy  would  likely  provide  the  same  load  miss  ratio  as  store-in  with  a  SAP 
policy.  A  SANF  policy  can  also  be  used  with  store-through.  Such  a  policy  will  not 
reduce  the  amount  of  network  store  traffic,  but  will  likely  reduce  the  amount  of  net- 
work load  traffic  by  reducing  the  load  miss  ratio. 

Machine  organization  parameters  that  influence  the  update  policy  choice  are  PE 
cycle  time,  MM  cycle  time,  and  the  network  packet  size.  If  the  PE  cycle  time  is 
large,  the  rate  at  which  requests  are  issued  by  the  PE  may  be  sufficiently  low  that 
the  network  could  absorb  the  increased  number  of  requests  generated  by  a  store- 


Ill 

through  policy.  Likewise,  if  the  network  packet  size  is  large  the  network  may  be 
able  to  absorb  the  store  traffic.  If  the  MM  cycle  time  is  small,  the  interference  of 
network  stores  on  loads  may  be  negligible;  if  the  cycle  time  is  large  the  increased 
number  of  network  stores  generated  by  a  store-through  policy  is  likely  it  decrease 
system  performance.  In  the  remainder  of  this  section  simulation  results  are  used  to 
examine  the  effects  of  update  policies. 

In  Tables  5.12  -  5.14  are  the  results  of  update  policy  simulation  experiments 
using  the  cache  simulator  in  trace-driven  mode.  The  trace  data  used  consisted  of  six 
processes  derived  from  the  four  parallel  codes  described  earlier.  The  results  shown 
include:  instruction,  private  load,  and  shared  cacheable  load  miss  ratios,  the  number 
of  network  requests  generated,  and,  in  the  case  of  a  store-in  policy,  the  amount  of 
dirty  data  (four  byte  quantities)  remaining  in  the  cache  when  the  processes  ter- 
minated. At  process  termination  the  dirty  data  must  be  copied  back  to  central 
memory  in  order  to  maintain  cache  coherence  except  that  private  data  need  not  be 
copied  since  it  can  no  longer  be  accessed.  For  these  experiments  all  system  parame- 
ters were  held  constant  except  for  the  update  policies.  A  line  size  of  32  bytes  and 
an  interleaving  granularity  of  8  bytes  are  assumed. 

Tables  5.12a  and  5.12b  give  the  results  for  store-in  and  store-through  policies, 
i.e.  store-in  with  SAF  and  store-through  with  NSA. 
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experiment 
number 

store 

-in 

miss  ratios 

network 

dirty 

I 

P 

S 

requests 

data 

1 

.001 

.006 

.025 

660 

179 

2 

.001 

.001 

.140 

3664 

370 

3 

.001 

.003 

.501 

1912 

355 

4 

.001 

.001 

.448 

3816 

502 

5 

.001 

.001 

.726 

6400 

504 

6 

.003 

.015 

.359 

2308 

125 

Table  5.12a.  Results  of  store-in  experiments 


experiment 
number 

store-th 

rough 

miss  ratios 

network 

dirty 

I 

P 

S 

requests 

data 

1 

.001 

.007 

.055 

2688 

2 

.001 

.003 

.173 

8192 

3 

.001 

.006 

.534 

3352 

4 

.001 

.002 

.448 

6029 

5 

.001 

.002 

.726 

9066 

6 

.003 

.018 

.359 

3438 

Table  5.12b.  Results  of  store-through  experiments 


The  tables  indicate  that  a  store-through  policy  does  generate  a  greater  number  of 
network  requests  than  a  store-in  policy:  for  the  experiments  shown,  the  increase  is 
between  1.4  and  4  times  the  number  of  network  requests.  The  greatest  increase  is 
in  the  first  two  experiments.  The  large  increase  is  due  to  the  small  miss  ratios 
obtained  using  a  store-in  policy;  the  other  experiments  do  not  have  as  great  an 
increase  because  of  the  larger  miss  ratios.  The  tables  also  indicate  that  a  store-in 
cache  provides  a  smaller  load  miss  ratio  for  private  data.  A  smaller  load  miss  for 
shared  cacheable  data  was  achieved  in  three  of  the  six  experiments  shown.  For 
experiments  4-6,  a  store-in  policy  did  not  provide  a  miss  ratio  improvement  over 
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store-through  because  the  shared  cacheable  data  in  those  applications  was  typically 
read  before  written. 

Table  5.12a  indicates  that  a  substantial  amount  of  dirty  data  must  be  copied 
back  to  memory  at  process  termination.  Before  a  PE  can  begin  execution  of 
another  process  the  dirty  data  in  its  cache  must  be  copied  back  to  memory  and  ack- 
nowledgements must  be  received  from  the  stores.  This  flushing  of  the  cache  will 
produce  a  flood  of  traffic  and  take  a  substantial  number  of  cycles  to  complete. 

Let  T\  be  the  amount  of  dirty  data  left  in  a  cache  at  process  termination.  Let  k 
be  T]  divided  by  the  mterleaving  granularity.  Let  a,  8,  and  I,  be  defined  as  above. 
In  the  best  case,  i.e.  the  dirty  data  is  packed  into  the  fewest  possible  number  of 
cache  lines,  the  number  of  cycle  needed  to  issue  the  store  to  perform  the  flush  is 
K(^  +  a).  Consider  Experiment  5  in  Table  5.12a:  for  an  interleaving  granularity  of  8 
bytes  (2  words)  and  a  packet  size  of  4,  756  network  cycles  are  required  to  issued 
the  stores  necessary  to  flush  the  cache.  In  the  worst  case,  i.e.  each  word  is  in  a  dif- 
ferent cache  line  (or  partial  cache  line),  1008  network  cycles  are  required  to  initiate 
the  store  requests.    In  the  packet  size  is  2,  the  above  cycle  values  are  doubled. 

Some  of  the  dirty  data  is  private  and  need  not  be  copied  back  to  memory;  how- 
ever, as  will  be  shown  later,  the  amount  of  private  dirty  data  is  a  very  small  percen- 
tage of  the  total  dirty  data.  Note  that  each  PE  that  executes  a  process  from  the 
same  Doall  loop  will  have  (approximately)  the  same  amount  of  dirty  data  in  its 
cache  at  process  termination. 
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experiment 

number 

store- in  \vi 

th  SAXF 

miss  ratios 

network 

dirty 

I 

P 

S 

requests 

data 

1 

.001 

.006 

.025 

400 

179 

2 

.001 

.001 

.140 

3108 

370 

3 

.001 

.003 

.501 

1768 

355 

4 

.001 

.001 

.448 

3800 

502 

5 

.001 

.001 

.726 

6380 

504 

6 

.003 

.015 

.359 

2276 

125 

Table  5.13a.  Results  of  store-m  with  SANF 


experiment 
number 

store-through 

with  SANF 

miss  ratios 

network 

dirty 

I 

P 

S 

requests 

data 

1 

.001 

.006 

.025 

2536 

2 

.001 

.001 

.140 

7640 

3 

..001 

.003 

.501 

3208 

4 

.001 

.001 

.448 

6013 

5 

.001 

.001 

.726 

9046 

6 

.003 

.015       .359 

3410 

Table  5.13b.  Results  of  store-though  with  SANF 


Tables  5.13a  and  5.13b  shown  the  results  for  a  store-in  and  store-through  cache 
using  the  S.ANF  policy.  Comparmg  Table  5.13a  with  5.12a,  note  that  the  SANF  pol- 
icy caused  a  reduction  in  the  number  of  network  requests  in  all  six  experiments. 
The  reductions  are  greater  in  the  first  three  experiments  that  in  the  latter  three. 
This  is  due  to  the  reading  before  writing  of  shared  cacheable  data  in  the  latter  three 
applications.  Again  comparing  Tables  5.13a  and  5.12a,  note  that  the  SANF  policy 
provided  the  same  load  miss  ratio  as  the  S.AF  policy.  Also  note  that  the  amount  of 
dirty  data  is  the  same;  this  is  expected  since  a  SANF  policy  does  not  affect  the 
amount  of  modified  data,  but  the  number  of  network  requests  generated  due  to  PE 
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stores. 

Comparing  Tables  5.13b  with  5.12b,  note  that  the  miss  ratios  provided  by  the 
SANF  policy  are  smaller.  In  fact,  the  miss  ratios  are  equal  to  those  produced  by  the 
store-in  cache.  This  occurrence  is  expected  since  a  SANF  policy  provides  a  similar 
function  as  a  store-in  cache  with  SAP  in  that  modified  data  is  allocated  in  the  cache. 
Also,  note  from  Tables  5.13b  and  5.12b  that  there  is  a  reduction  in  the  number  of 
network  requests.  This  reduction  in  network  requests  is  a  consequence  of  the 
reduction  in  the  miss  ratios. 


experiment 
number 

combinec 

policy 

miss  ratios 

network 
requests 

dirty 
data 

I 

p 

S 

1 

.001 

.006 

.055 

848 

17 

2 

.001 

.001 

.173 

4108 

37 

3 

.001 

.003 

.534 

2344 

23 

4 

.001 

.001 

.448 

4388 

16 

5 

.001 

.001 

.726 

6424 

18 

6 

.003 

.015 

.359 

2358 

21 

Table  5.14.  Combined  policy:  private  store-in,  shared  store-through 

Table  5.14  shows  the  results  of  simulations  experiments  when  a  "combined" 
update  policy  is  used.  The  combined  policy  consists  of  a  store-in  (SAF)  policy  for 
private  data  and  a  store-through  (NSA)  policy  for  shared  cacheable  data.  Compar- 
ing Table  5.14  with  table  5.12b,  note  that  the  combined  policy  produces  substantially 
less  traffic  than  a  store-through  policy;  the  amount  of  reduction  from  the  store- 
through  policy  is  between  27  and  68  percent.  Comparing  Table  5.14  and  5.12a,  the 
combined  policy  does  produce  more  traffic  than  store-in;  however,  the  percentage 
of  increase  is  not  great  (between   .3  and  22  percent).    .-Xgain  comparing  5.14  and 


116 

5.12a,  note  that  the  combined  policy  has  a  clear  advantage  over  store-in  because  of 
the  reduction  in  the  amount  of  dirty  data.  The  only  dirty  data  in  the  combined- 
policy  cache  is  private  data  and  this  data  does  not  have  to  be  copied  back  to  central 
memory  at  process  termination.  Thus,  no  PE  cycles  are  lost  performing  a  cache 
flush  (which  is  necessary  when  a  pure  store-in  policy  is  used).  Note  that  the 
amount  of  dirty  private  data  is  very  small:  thus,  in  tables  5. 12a  and  5. 13a  the  signifi- 
cant percentage  of  dirty  data  is  shared  cacheable  data,  which  must  be  copied  back  to 
central  memory. 

Note  that  the  private  miss  ratios  in  Table  5.14  are  the  same  as  in  Table  5.12a 
and  that  the  shared  cacheable  miss  ratios  are  the  same  as  in  Table  5.12b.  This 
result  is  expected  due  to  the  composition  of  the  combined  policy  (store-in  for 
private  data  and  store-through  for  shared  cacheable  data).  A  SANF  policy  can  be 
used  in  conjunction  with  a  combmed  update  policy.  A  combined  policy  with  SANF 
would  provide  a  shared  cacheable  miss  ratios  (for  above  experiments)  identical  to 
those  in  Table  5,13b.  The  amount  of  network  traffic  would  also  be  reduced  from 
the  levels  shown  in  Table  5.14,  the  reduction  being  due  to  the  reduced  shared  cache- 
able  miss  ratios. 

In  the  following  discussion  results  from  the  machine  simulator  run  in  self- 
driven  mode  are  used  to  compare  update  policies  for  various  system  configurations. 
In  these  experiments  a  line  size  of  32  bytes  is  assumed  with  an  interleaving  granu- 
larity of  8  bytes.  The  PE  request  mix  assumed  is  as  follows:  60%  instructions,  20% 
private  loads,  10%  private  stores,  7%  shared  cacheable  loads,  and  3%  shared  cache- 
able  stores.    The  miss  ratios  for  a  store-throueh  cache  are:  1%  for  instructions,  10% 
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for  private  data,  and  207c  for  shared  cacheable  data.  Since  a  store-in  cache  pro- 
vides a  better  miss  ratios  for  data,  miss  ratios  of  19c  for  instructions,  S.S^c  for 
private  data  and  179c  for  shared  cacheable  data  are  assumed  for  a  store-in  cache. 
The  results  do  not  reflect  the  impact  of  the  cache  flush  needed  to  update  the  central 
memory  at  process  terminate  for  cache  using  a  store-in  policy. 

Figures  5.20-5.23  show  the  results  of  experiments  that  examine  the  effects  of 
various  PE  cycle  times  on  update  policies.  For  each  of  the  experiments  the  MM 
cycle  time  was  held  constant  at  four.  The  PE  cycle  time  was  varied  between  2  and  8 
cycles.  Figure  5.20  shows  the  average  PE  utilization  for  a  store-in  cache, 
represented  by  A,  a  store-through  cache,  represented  by  B,  and  a  combined  policy 
of  store-in  for  private  data  and  store-through  for  shared-cacheable  data,  represented 
by  C.  In  the  experiments  the  network  packet  size  was  four.  Figure  5.20  indicates 
that  a  store-in  update  policy  provides  the  best  utilization  at  all  PE  speeds.  How- 
ever, as  the  PE  cycle  time  increases  the  percentage  of  performance  improvement 
over  store-through  decreases  from  S.I've  to  3.69^  and  the  improvement  over  the 
combined  policy  decreases  from  27c  to  .57c.  Note  that  combined  policy  provides  a 
PE  utilization  very  close  to  the  store-in  policy  at  PE  cycle  times. 

Figure  5.21  shows  the  average  PE  utilization  for  experiments  identical  to  those 
whose  results  are  shown  in  Figure  5.20,  except  the  network  bandwidth  is  reduced  by 
half;  instead  of  a  packet  size  of  four  these  experiments  were  run  with  a  packet  size 
of  two.  Note  that  the  results  in  Figure  5.21  are  very  similar  to  those  in  Figure  5.20, 
except  for  the  lower  utilizations  due  to  the  reduced  network  bandwidth.  Again,  the 
store-in  policy  provides  the  highest  utilizations,  but  a  combined  policy  provides  utili- 
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zations  that  are  very  close  to  those  provided  by  store-in. 

The  results  of  a  second  set  of  experiments  are  shown  in  Figures  5.22  and  5.23. 
These  experiments  are  identical  to  the  experiments  above  except  a  SANF  policy  was 
used  with  each  of  the  update  policies.  The  store-through  miss  ratios  were  assumed 
to  be  equal  to  the  store-in  miss  ratios  since  a  SANF  policy  is  used.  Figures  5.22  and 
5.23  indicate  that  a  store-in  policy  with  a  SANF  store-miss  policy  provides  the  best 
utilization  for  all  of  the  experiments.  Note  that  due  to  the  reduced  miss  ratios  for 
store-through  the  percentage  of  improvement  store-in  has  over  store-through  is 
reduced.  Also,  the  combined  policy  with  SANF  provides  utilization  that  are  almost 
identical  to  the  store-in  policy. 

Figures  5.24  and  5.25  give  the  average  PE  utilization  for  a  series  of  experi- 
ments where  the  MM  cycle  is  varied.  In  these  experiments  the  PE  cycle  time  was 
held  constant  at  two  and  the  MM  cycle  varied  between  2  and  8  cycles.  A  SAF  pol- 
icy was  used  with  store-in  and  NSA  was  used  with  store-through.  A,  B,  and  C 
represent  the  same  update  policies  as  above.  Experiments  whose  result  are  shown 
in  Figures  5.23  and  5.24  have  a  network  packet  size  of  four  and  two,  respectively. 
In  examining  the  Figures,  note  that,  as  in  the  previous  figures,  a  store-in  policy  pro- 
vided the  best  PE  utilization.  The  results  show,  again,  that  a  combined  policy  pro- 
vides utilizations  that  are  very  close  to  a  pure  store-in  policy. 
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CHAPTER   6 


CONCLUSIONS  AND  REMARKS 


Presented  in  the  previous  chapter  are  results  of  many  experiments  that  examine 
the  effects  of  three  cache  parameters  (interleaving  granularity,  line  size,  and  update 
policy)  on  the  performance  of  a  parallel  processor.  The  experiments  were  con- 
ducted usmg  simulators  in  trace-driven  and  self-driven  modes.  The  analytic  model 
in  chapter  4  was  also  used  to  examine  the  effects  of  the  parameters  on  system  per- 
formance. Presented  in  the  following  sections  are  conclusions  drawn  from  the 
results  of  the  simulation  experiments  and  from  the  discussions  using  the  analytic 
models.  The  last  section  describes  topics  for  continued  research  in  the  analysis  of 
cache  memories  for  parallel  processors. 

6.1.    Interleaving  Granularity 

Interleaving  granularity  is  not  generally  considered  a  cache  parameter.  The 
results  in  section  5.1,  however,  show  that  the  granularity  has  a  significant  effect  on 
system  performance.  Thus,  the  distribution  of  a  cache  line  in  the  central  memory  is 
an  important  cache  parameter  for  a  parallel  processor.  Practical  interleaving  granu- 
larities are  bounded  by  the  cache  line  size,  as  an  upper  bound,  and  the  largest 
operand  size  of  atomic  operation  (e.g.  a  double  word  store),  as  the  lower  bound. 

From  the  results  presented  in  section  5.1  an  interleaving  granularity  of  8  bytes 
provides  the  best  overall  utilization  for  line  size  of  32  and  64  bytes.  A  granularity 
of  4  bytes,  rather  that  8  bytes,  provides  a  slightly  better  utilization  (at  most  a  \.5% 
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improvement)  for  a  line  size  of  16  bytes.  Although  a  performance  improvement 
was  obtained,  such  a  small  granularity  would  not  be  practical  since  atomic  opera- 
tions to  double  word  quantities  would  require  semaphores.  Since  such  atomic 
operations  are  required  in  many  scientific  programs  the  slight  gain  in  performance 
might  be  offset  by  implementing  the  atomic  operations  using  semaphores. 

For  a  line  size  of  64  bytes  an  interleaving  granularity  of  16  bytes  provides  utili- 
zation close  to  that  of  a  granularity  of  8  bytes  (in  one  case  a  granularity  of  16  bytes 
provided  the  better  utilization).  Because  of  this  close  shadowing,  the  optimal  granu- 
larity for  a  line  size  of  128  is  expected  to  be  16  bytes.  The  reason  for  the  shift  to  a 
larger  granularity  is  the  increased  number  of  requests  generated  to  fetch  a  cache 
line  of  128  bytes. 

For  a  line  size  of  32  bytes  an  interleaving  granularity  of  8  bytes  implies  that 
four  load  requests  must  be  generated  to  fetch  a  cache  line  from  central  memory. 
Since  the  four  requests  traverse  the  network  independently  and  are  serviced  by  the 
MMs  asynchronously,  the  cache  must  be  able  to  receive  the  responses  out  of 
sequence.  For  a  cache  organization  as  shown  in  Figure  3.1,  this  implies  that  the 
miss  tag  buffer  must  contain  four  residency  bits,  a  bit  for  each  partial  cache  line. 
Likewise,  if  a  "combined"  tag  store  is  used  (as  described  in  Section  3.1)  four 
residency  bits  are  required  in  the  tag  store  regardless  of  the  update  policy. 
Although  more  hardware  is  require  to  support  interleaving  granularities  of  less  than 
a  cache  line,  the  performance  gain  is  significant. 

Another  conclusion  that  can  be  drawn  from  Section  5.1  is  that  the  interconnect 
network  should  be  configured  as  an  inverse  shuffle  from  the  PEs  to  the  MMs.    The 
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analytic  model  and  simulations  results  show  that  bottlenecks  near  the  MMs  reduce 
performance.  An  inverse-shuffle  network  allows  requests  generated  to  fetch  a 
cache  line  to  traverse  the  network  independently,  thus  minimizing  conflicts  between 
cache  line  fetches  in  the  network. 

6.2.   Line  Size 

Analysis  of  cache  line  size  is  difficult  since  the  line  size  that  provides  the  best 
overall  utilization  is  dependent  on  the  programs  used  to  generate  the  trace-data 
which  is  used  to  drive  simulation  models.  To  analyze  the  relationship  between  line 
size,  network  interference,  and  system  performance,  experiments  were  conducted 
assuming  a  miss  ratio  improvement,  r,  as  a  line  size  was  doubled.  The  results  of 
these  experiments,  Figures  5.14-5.19,  show  that  for  a  network  with  a  high 
bandwidth,  packet  size  of  four,  a  line  size  of  128  bytes  provides  the  best  perfor- 
mance provided  rs.7.  The  performance  gain  over  a  line  size  of  64  bytes  ranges 
from  2.7<7c  to  7.59'f .  If  the  bandwidth  is  lower,  a  packet  size  of  two,  a  line  size  of 
64  bytes  provides  the  best  utilization  when  r<.7;  a  performance  gain  of  up  to  1  ."i^c 
over  a  line  size  of  128  bytes  is  obtained.  For  either  network  bandwidth,  as  r-^l  the 
line  size  that  provides  the  best  performance  becomes  smaller.  If  no  miss  ratio 
improvement  is  obtained  by  increasing  the  line  size.  i.e.  r,=  l  for  all  /,  a  line  size 
equal  to  the  largest  atomic  unit  will  provide  the  best  utilization.  The  graphs  in  Fig- 
ures 5.14-5.19  indicate  that  the  cache  line  size  that  provides  the  best  overall  utiliza- 
tion should  be  half  the  previous  line  size  for  each  reduction  of  r  by  .1,  where  the 
initial  line  size  for  a  network  with  packet  size  of  four  is  128  bytes  and  the  initial  line 
size  of  a  network  with  packet  size  of  two  is  64  bytes. 
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The  results  from  the  trace-driven  simulations  show  that  the  line  size  has  only  a 
small  effect  on  the  instruction  and  private  data  miss  ratios:  Each  line  size  provided 
a  very  small  miss  ratio  due  to  chunking.  The  line  size  had  a  significant  effect  on  the 
shared  cacheable  miss  ratio;  however,  the  effect  of  line  size  on  the  miss  ratio  differs 
for  various  applications.  Some  of  the  results  show  a  significant  reduction  in  the 
miss  ratio  as  the  line  size  is  doubled,  others  show  a  small  or  negative  improvement. 

The  results  of  the  utilization  experiments  in  Table  5.11b  clearly  shov\  that  for  a 
network  with  a  small  bandwidth  a  line  size  of  32  bytes  provides  the  best  overall  util- 
ization. The  small  line  size  provided  the  best  overall  utilization  since  the  miss  ratio 
improvements  were  small  and  the  network  bandwidth  was  insufficient  to  handle 
responses  consisting  of  large  numbers  of  packets.  For  a  network  where  the  packet 
size  is  four  the  choice  of  line  size  is  not  as  clear.  The  results  in  Table  5.11a  show 
that  a  line  size  of  64  bytes  provides  the  best  overall  performance.  However,  a  line 
size  of  64  bytes  has  a  performance  gain  of  only  -2.0'7c  to  4.0%  over  a  line  size  of  32 
bytes. 

6.3.   Update  policy 

The  results  of  the  update  policy  experiments  in  Section  5.3  show  that  the  choice 
of  update  policy  has  an  large  effect  on  the  amount  of  network  traffic  that  is  gen- 
erated; a  store-through  policy  generated  between  1.4  and  4  times  the  number  of 
requests  as  a  store-in  policy.  The  results  also  show  that  a  store-through  update  pol- 
icy provides  the  lowest  utilization  and  a  store-in  policy  provides  a  highest  level  of 
utilization,  but  leaves  a  large  amount  of  dirty  data  in  the  cache  at  process  termina- 
tion. When  a  SANF  policy  was  used  uith  store-through  and  store-in,  the  number  of 
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network  requests  generated  by  the  update  policies  was  reduced  and  a  performance 
improvement  was  obtained. 

The  results  presented  in  Section  5.3  mdicate  that  the  best  overall  update  policy 
for  a  cache  in  a  parallel  system  is  one  that  has  a  combined  update  policy;  a  policy 
that  combines  store-in  for  private  data  w  ith  store-through  for  sliared  cacheable  data. 
The  combined  policy  provided  the  best  overall  performance  since  it  provided  utiliza- 
tions that  were  close  to  store-in  and  since  no  PE  cycles  are  lost  evicting  dirty  data 
from  the  cache  at  process  termination.  Comparing  the  performance  of  store-in  with 
the  combined  policy,  the  result  in  Section  5.3  shows  that  a  store-in  policy  provided, 
at  best,  a  2%  improvement  in  performance  over  the  combined  jiolicy.  The  amount 
of  improvement  decreased  rapidly  as  the  cycle  time  of  the  PEs  or  MMs  increase. 
At  the  largest  PE  cycle  time  in  the  experiments  the  store-in  policy  provided  only  a 
.5%  improvement  in  performance. 

A  combined  update  policy  is  well  suited  for  a  parallel  processor  since  the 
advantages  of  both  the  update  policies  are  exploited  and  the  disadvantages  are 
minimized:  Since  stores  to  private  data  are  relatively  frequent  and  since  the  stores 
are  commonly  to  a  small  address  range,  a  store-in  policy  will  produce  the  minimal 
amount  of  network  traffic.  The  disadvantage  of  a  store-in  policy  does  not  apply 
since  private  data  does  not  have  to  be  copied  back  to  central  memory  at  process  ter- 
mination. If  a  context  switch  is  necessary  prior  to  process  termination  the  amount 
of  dirty  data  is  very  small.  Store-though  is  well  suited  for  shared  cacheable  stores 
since  the  frequency  of  stores  is  low  and  the  addresses  referenced  are  typically 
dispersed  throughout  memory. 
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6.4.    Topics  for  Further  Research 

Discussed  in  chapter  three  is  a  cache  organization  that  supports  multiple  out- 
standing requests  (consisting  of  both  loads  and  stores).  However,  the  simulation 
and  analytic  models  used  in  this  thesis  did  not  address  outstanding  loads;  both 
models  assumed  that  a  cache  would  lock  up  while  a  load  was  outstanding.  These 
models  can  be  extended  to  allow  outstanding  loads  generated  by  PE  prefetches. 
Providing  a  PE  uith  a  prefetch  capability  could  affect  the  line  size  choice:  since  the 
PE  can  initiate  prefetches,  the  benefits  provided  by  long  cache  lines  may  be 
reduced. 

The  simulation  model  also  assumed  that  the  network  was  composed  of  2x2 
switch  elements.  Studies  of  how  the  cache  parameters  discussed  in  this  thesis  are 
affected  by  switch  size  is  an  interesting  extension  of  this  thesis.  An  increase  in  the 
switch  size  will  decrease  the  number  of  network  stage,  but  increase  the  waiting  time 
for  each  stage  --  the  analytic  model  indicates  that  the  waiting  time  of  a  message 
increases  with  the  switch  size  (see  Eq.  4.13). 

A  cache  parameter  that  was  not  studied  in  this  thesis  is  the  size  of  the  miss  tag 
buffer,  miss  data  buffer,  and  outstanding  request  buffer.  The  size  of  these  buffers 
is  dependent  on  the  interleaving  granularity  and  update  policy.  If  the  PEs  have  the 
ability  to  prefetch,  the  the  size  of  the  buffers  needed  is  likely  to  increase. 
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